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ABSTRACT 


Methods  are  given  for  the  analysis  of  FORTRAN  DO  loops  for 
parallel  execution  on  asynchronous  and  synchronous  multi-processors. 
Limitations  of  the  analytic  approach  are  discussed.  Application  of 
the  methods  to  the  ILLIAC-IV  are  described.  A  Simulation  process 
for  deriving  concurrency  is  given  in  two  examples  for  which  the 
analytic  methods  are  inapplicable. 
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INTRODUCTION 


Any  program  using  a  significant  amount  of  computer  time  spends 
most  of  that  time  executing  one  or  more  loops.  For  a  large  class  of  programs, 

i 

these  loops  can  be  represented  as  FORTRAN  DO  loops.  We  will  consider 
methods  of  executing  these  loops  on  a  multiprocessor  computer,  in  which  1 
different  processors  independently  execute  different  iterations  of  the  loop 
at  the  same  time . 

This  approach  was  inspired  by  the  ILLIAC-IV,  since  it  is  the  only 
type  of  parallel  computation  which  that  computer  can  perform  [  1]  .  However, 
even  for  a  computer  with  independent  processors,  it  is  inherently  more 
efficient  than  the  usual  approach  of  having  the  processors  work  together  on 
a  single  iteration  of  the  loop.  This  is  because  it  iequires  much  less  com- 

S 

munication  among  the  individual  processors. 

The  methods  presented  are ,  of  course ,  independent  of  the  syntax  of 
FORTRAN.  The  basic  feature  of  the  FORTRAN  DO  loop  which  is  u’sed  is  that 
the  range  of  values  assumed  by  the  index  variable  is  known  upon  entry  to  ! 
the  loop.  Thus,  most  but  not  all  ALGOL  FOR  loops  can  be  handled. 

The  analysis  is  performed  from  the  standpoint  of  a  compiler  for  a 
multi-processor  computer.  Two  types  of  computers  are  considered:  those 
having  asynchronous  processors,  and  those  like  the  ILLIAC-IV  with  com¬ 
pletely  synchronous  processors.  A  number  of  restrictions  are  made  on  the 
loops  just  to  simplify  the  exposition.  In  Chapter  3,  we  will  discuss  the 
actual  limitations  of  the  techniques.  Chapter  4  describes  a  practical  ' 

example  of  their  use, 

!  * 

I 
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This  approach  to  the  parallel'  execution  of  loops  appears  to  be  very 
effective  for  a  large  class  of  programs.  If  has  significant  implications  for 

i  11 

the  design  of  future  computers  and  their  compilers. 

i  '  1  '  1 

Chapter  5  presents,  in  examples,  a  process  of  loop  control  history 

•  1  ' 

simulation  which  can  be  used  to  display  potential  concurrent  execution  for 

.  ! 

l 

loops  v/hich  do  'not  satisfy  the  requirements  for  the  analytic  methods . 

! 


I 
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I.  THE  GI<7EN  LOOP 


We  will  cons  ider  DO  loops  of  the  following  form: 
(1)  DO  a  l1  =  i} ,  u1 


DO  a  In=  in, 


u 


n 


loop  body 
a  CONTINUE 

where  the  1}  and  1/  are  positive  integers,*  and  the  loop  body  has  no  I/O 
statements,  no  subroutine  or  function  calls  which  can  modify  data,  and 
no  transfer  of  control  to  any  statement  outside  the  loop.  The  extension  to 
more  general  loops  will  be  discussed  later. 

Let  Z  denote  the  set  of  all  integers,  and  let  2Zn  denote  the  set  of 
n-tuples  of  integers.  For  completeness,  define  sP  =  [0] . 

The  index  set  $  of  the  loop  (1)  is  defined  to  be  the  subset 
{  (i1 ,  ...,  in)  :  ^  s  iU  u^  }  of  2n.  Thus ,  for  the  loop 

DO  7  I1  =  1,  10 

DO  7  I2  =  1,  20 


The  use  of  superscripts  and  subscripts  is  in  accord  with  the  usual 
notation  of  tensor  algebra . 
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pn)  of  iP 


=  {  (x,  y)  :  Is  xs  10,  xs  ys  20  ) 


An  execution  of  the  loop  body  for  an  element  (pA 
is  the  process  of  setting  I*  =  p* ,  . . . ,  In  =  pn  and  then  executing  the  loop 
body  in  the  usual  fashion,  stopping  when  statement  a  is  reached. 
Executing  the  entire  loop  (1‘  chen  involves  the  execution  of  the  loop  body 
for  each  element  of  tj)  ,  in  the  order  specified  by  the  DO  statements . 

This  suggests  that  we  order  the  elements  of  2£n  lexicographically 
in  the  usual  manner,  with  (2,9,  13)  <  (3,  -1,  10)  <(3,0,0),  Then  for 
any  elements  P  and  Q  of  ,  the  loop  body  is  executed  for  P  before  it  is 
executed  for  Q  if  and  only  if  P  <  Q.  Thus,  the  relation  <  on  Zn  gives  the 
appropriate  temporal  ordering  of  J)  ,  In  the  preceding  example,  the  loop 
body  is  executed  for  (2,  11)  before  it  is  executed  for  (3,  5),  since 
(2,  11)  <  (3,  5). 

Define  addition  and  subtraction  of  elements  of  2£n  by  coordinate- 
wise  addition  and  subtraction,  as  usual.  Thus,  (3,  -1,  0)  +  (2,  2,  4)  = 

(5 ,  1 ,  4) .  Let  0  denote  the  element  (0 ,  0 ,  . . , ,  0) .  It  is  easy  to  see  that 

n 

for  any  P,  Q  e  22  ,  we  have  P  <  Q  if  and  only  if  Q  -  P  >  0 . 
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II.  THE  DO  CONC  STATEMENT 

Our  objective  is  to  fine  a  new  temporal  ordering  of  the  executions  of 
the  loop  body  so  that  at  any  given  time,  the  loop  body  is  being  executed  in 
parallel  for  different  elements  of  the  index  set  by  different  processors.  This 
new  ordering  must  yield  an  algorithm  which  is  equivalent  to  the  one  des¬ 
cribed  by  the  original  loop;  i.e„,  one  which  computes  the  same  values  for 
all  variables  as  the  original  loop. 

Consider  the  loop 

(2)  DO  10  I1  =  1,  3 

DO  10  I2  =  2,  7 

A  (I1  +  3,  I2)  =  0 
10  CONTINUE 

The  loop  body  could  be  executed  in  parallel  by  three  processors  for  the 
points  (1,  6),  (2,  5),  and  (3,  4)  of  (In  fact,  it  could  be  executed  in 
parallel  by  18  processors  for  all  points  in  *5.) 

In  order  to  have  a  means  of  expressing  parallel  computation,  we 
define  the  DO  CONC  (for  CONCurrentiy)  statement.  Its  form  is 
DO  a  CONC  FOR  ALL  I  G 

where  ^  is  a  finite  subset  of  22.*  It  has  the  following  meaning:  Let 

•ff 

We  remind  the  reader  that  a  set  is  just  an  unordered  collection  of  elements, 
so  { 1,2}  =  {2,1}  =  {  1,2, 1,1,2}  s  We  will  not  bother  to  define  a  syntax 
for  expressing  sets.  The  usual  FORTRAN  DO  syntax,  which  can  only 
describe  a  restricted  class  of  subsets  of  Z,  is  probably  the  most  con¬ 
venient  to  implement. 
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J{  =  fij,  •  ••»  *m}/  where  no  two  ij  are  equal,  and  assume  that  we  have 
m  independent,  completely  asynchronous  processors  numbered  1  through  m. 
Then  each  processor  is  to  execute  the  statements  following  the  DO  CONC 
statement,  through  statement  a  ,  with  processor  number  j  setting  1=  i  . 

The  m  processors  are  to  run  concurrently,  independent  of  one  another. 

As  an  example ,  consider 

DO  10  CONC  FOR  ALL  Je(x:2sxs5) 

10  A(J)  =  J  **  2 

This  sets  A(2)  =  4 ,  A{3)  =  9 ,  A(4)  =  16  and  A(5)  =  25 . 

For  a  DO  CONC  to  give  a  well-defined  algorithm,  certain  restrictions 
must  be  made  on  the  statements  in  its  range.  Suppose  the  statement 

9  B(J)  =  A(J  +  1) 

is  inserted  before  statement  10  above.  The  resulting  DO  CONC  loop  does 
not  give  well-defined  results.  For  example,  the  processor  doing  the  com¬ 
putation  for  J  =  3  sets  B(3)  to  the  value  of  A(4) .  But  the  value  of  A(4)  it  uses 
depends  upon  whether  or  not  the  processor  for  J  =  4  has  already  executed 
statement  10.  Since  the  processors  are  assumed  to  be  asynchronous,  the 
resulting  value  of  B(3)  is  not  well-defined. 

We  will  not  bother  specifying  the  necessary  restrictions  on  the 
DO  CONC  loop.  The  DO  CONCs  which  will  be  written  appear  in  loops  which 
are  equivalent  to  the  original  DO  loop  (1),  and  hence  must  give  well- 
defined  algorithms. 

The  DO  CONC  statement  is  generalized  to  the  form 
DO  a  CONC  FOR  ALL  (I1 ,  . . . ,  IK)e  J) , 
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where 


/. 


is  a  subset  of  Z 


element  (p,  . . . ,  p  ;  e 


£. 


for  I A  =  p  \ 


Tk  _  k 
•  •  •  /  i  —  p  • 


The  meaning  should  be  clear:  for  each 
we  have  a  processor  performing  the  calculation 
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III.  REWRITING  THE  LOOP 

Consider  loop  (2) ,  with  index  set  J).  Changing  the  order  of 
execution  of  the  loop  body  for  the  different  elements  of  iP  obviously 
does  not  change  the  algorithm.  The  loop  can,  therefore,  be  rewritten  as 
a  single  DO  CONC,  or  in  many  different  ways  as  a  nested  DO/DO  CONC 
loop.  Choosing  one  of  these  ways,  we  rewrite  it  as  follows: 

(3)  DO  10  J1  =  3,  10 

DO  10  CONC  FOR  ALL  J2  e  [y:  2  i  y  s  7  and  J1  -3  s  y  s  J1  -1} 

A  (J1  -  J2+  3,  J2)=  0 
10  CONTINUE  . 

The  choice  is  arbitrary  and  unnatural,  but  instructive. 

To  actually  construct  loop  (3) ,  we  first  defined  the  one-to-one 
2  ? 

mapping  J:  7L  ■*  Z by 

J  [  (I1,  I2)  ]  =  (IX+  I2,  I2)=  (J1,  J2)  , 

as  illustrated  in  Figure  1.  We  next  defined  the  index  set  ^ to  be  the 

set  }{<P)  =  {  j(p);  Pe  D]  .  Then  J^=  {  (J1,  J2):  3  s  J1  s  10,  2sJ2s  7 
1  2  1, 

and  J  -  3s  j  s  J  -1},  and  we  filled  in  the  limits  of  the  DO  and  DO 
CONC  statements  to  give  this  index  set.  Finally,  we  rewrote  the  loop 
body  in  such  a  way  that  executing  the  body  of  loop  (3)  for  the  point 
J(P)  e  ^  is  equivalent  to  executing  the  body  of  loop  (2)  for  the  point 
P  e  .  In  other  words,  A(JX  -  J2  +  3,  J2)  references  the  same  array 
element  as  A(IX  +  3 ,  I2)  when  (J1 ,  J2)  =  J  [  (I1 ,  I2)  ]  . 
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We  can  consider  loop  (3)  to  be  the  same  as  loop  (2) ,  except  with 
a  different  order  of  execution  of  the  body  for  the  elements  of  J).  This  order 
of  execution  is  illustrated  in  Figure  2 .  The  loop  body  is  executed  con¬ 
currently  for  ail  points  in  iO  lying  on  a  straight  line  J1  =  constant.  The 
execution  for  those  points  of  J?  with  =  3  precedes  the  execution  for  the 
points  with  J1  =  4,  which  in  turn  precedes  the  execution  for  the  points 
with  J1  =  5 ,  etc. 

This  suggests  that  we  define  the  mapping  rr:  Z?  -*  Z  by 
tt  T  (i1,  i2)  ]  =  J1  =  I1  +  I2 

Then  the  execution  for  P  e  iP  precedes  the  execution  for  Q  e  <P  if  and 
only  if  rr(P)  <  tt{Q),  If  tt(P)  =  rr(Q),  then  the  two  executions  of  the  loop 
body  are  concurrent. 

The  generalization  of  this  rewriting  procedure  is  straightforward. 

Loop  (1)  will  be  rewritten  in  the  form 

(4)  DO  a  J1  =  X1,  u1 


DO 


DO 


±  _  ,k  k 
a  j  =  A  ,  u 


a  CONC  FOR  ALL  (Jk+ 1 .  f)e/  . 

J . J 


9 


L 


loop  body 


a  CONTINUE 

where  J,  ,  is  a  subset  cf  2Z“  *  which  may  depend  upon  the  values 

J . f 


,n-k 


lk  i  i 

of  J  ,  ...  I  .  Here,  X  and  u  need  not  be  integers,  but  may  be  integer 
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valued  expressions  whose  values  depend  upon  J*,  . J* 

To  perform  this  rewriting,  we  will  construct  a  one-to-one  mapping 
j:  Zn  -»  Zn  of  the  form 

(5)  J  C  (I1 . In)  ]  =  <£  a*  I3 . S  af  Ij)  =  (J1,  Jn) 

j=l  3  j=l  3 


for  integers  a*.*  We  then  choose  the  X*,  U*  and  $ ,  . 

J . J 


T 


so  that  the 


index  set  J^of  the  loop  (4)  equals  ]{$) ,  and  write  the  body  of  loop  (4)  so 
that  its  execution  for  the  point  J  (P)  e  0  is  equivalent  to  the  execution  of 
the  body  of  loop  (1)  for  P  e  J) . 

Define  the  mapping  tt:  22n  -♦  22k  by 

tt  T  (I1 . In)  ]  =  (J1 . Jk)  , 

so  tt(P)  consists  of  the  first  k  coordinates  of  J  (P) .  It  is  then  clear  that 

for  any  points  J  (P) ,  J  (Q)  ,  the  execution  of  the  body  of  loop  (4)  for 

J  (P)  precedes  the  execution  for  J  (Q)  if  and  only  if  n  (P)  <  tt  (Q) .  If 

we  consider  loop  (4)  to  be  a  reordering  of  the  execution  of  loop  (1) ,  this 
statement  is  equivalent  to  the  following: 

(E)  For  any  P.  Q  e  § ,  the  execution  of  the  loop 
body  for  P  precedes  that  for  Q,  in  the  new 
ordering  of  executions ,  if  and  only  if 
tt  (P)  <  tt  (Q) . 


J  is  one-to-one  if  and  only  if  (5)  can  be  solved  to  write  the  1^  as  linear 
expressions  in  the  J1  with  integer  coefficients. 


The  loop  body  is  executed  concurrently  for  all  elements  of 

1/ 

lying  on  a  set  of  the  form  { P:  tt(P)  =  constant  e  22  }  .  Since  J  is 
assumed  to  be  a  one-to-one  linear  mapping,  these  sets  are  parallel 
(n-k)-dimensional  planes  in  .*  We  thus  have  concurrent  execution 
of  the  loop  body  along  (n-k)-dimenslonal  planes  through  the  index  set. 

Naturally,  we  cannot  use  any  arbitrary  mapping  J.  We  must  find 
one  for  which  loop  (4)  gives  an  algorithm  equivalent  to  that  of  loop  (1). 
This  is  the  goal  of  the  following  analysis. 

Observe  that  rewriting  loop  (1)  so  all  executions  are  concurrent; 
i.e. ,  with  a 

DO  a  CONC  FOR  ALL  (I1,  . . . ,  In)  e  J? 

statement,  involves  setting  J  equal  to  the  identity  mapping,  k  =  0, 
and  tt:  Zn-*  "sP  the  mapping  defined  by  tt(P)  =  0  for  all  P  e  En  . 


*We  consider  7&n  to  be  a  subset  of  ordinary  Euclidean  n-space,  as 
we  did  in  drawing  Figures  2  and  3 . 
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IV.  THE  BASIC  RULE 


We  first  introduce  some  terminology  to  aid  the  discussion.5 
Consider  the  variable  VAR  defined  by  the  statement 

DIMENSION  VAR  (10,  20)  . 

The  range  of  VAR  is  the  set  £  VAR  =  {  (x,  y):  Ux  s  10,  1  =.  y  *  20}  , 
which  is  a  subset  of  Z2 .  Thus,  £  VAR  is  the  set  of  all  (x,y)  e  Z2 
such  that  VAR  (x,y)  is  defined.*  1  . 

An  occurrence  of  VAR  is  any  appearance  of  it  in  the  loop  body. 
If  the  occurrence  appears  as 

VAR  (-,-)  =  ..., 

then  it  is  called  a  generation;  otherwise  it  is  called  a  use.  I.e., 
generations  modify  the  values  of  elements  of  the  array  VAR,  and  uses' 
do  not.  .  . 

Let  f  denote  an  occurrence  of  VAR  in  loop  (1)  of  the  form 
VAR  (I  +3,i)f  and  assume  n  =  3.  During  execution  of  the  loop  body 
for  the  element  (3,  4,  5)  e  ,  this  occurrence  becomes  VAR  (6,  5). 

S 

We  say  that  f  references  the  point  (6,  5)  e  for  (3,  4,  5). 

This  defines  an  occurrence  mapping  by 

*For  a  scalar  variable  x,  we  set  £  =  2ZC  . 

X 
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1 


letting  T^(P)  be  the  point  of  referenced  by  f  for  P  e  'J)  . 

In  this  case,  T^  is  given  by  , 

,  Tf  C  (p1,  P2,  P3)  ]  ,=  (p1  +  3,  P3)  . 

1  ) 

l  <  i 

We  will5 assume  that  all  variable  occurrences  only  have  the 

•  k 

1  n 

loop  variables  I1,  . . . ,  In  and  integer  constants  in  their  subscript 
expressions.  Then  for!any  variable  occurrence  g,  the  occurrence 
mapping  -Tg:  2Zn-»  22m  is  well-defined,  where  m  is  the  dimension 
((number  of  subscript  positions)  of  the  variable. 

I  Now  consider  the  loop 


(6)  'DO  23  Ix=2,10 

;  .DO  ,  23  I2  =  3,  17 
21  A  (I1,  I2)  =C  (I1)  1 


\  I  . 

22  Bd1,  I2)-=  A  (I3’  -1,  I2  +  1)  +  B  (I1,  I2  ) 
■'  <$)  @  @ 


23  CONTINUE. 

’ 

We  have  introduced  the  convention  of  writing  the  name  of  an  occurrence 

i 

in  a  circle  beneath  it.  For  the  point  (4,  7)  e  J?  ,  the  loop  body  is 

i 

21  A  (4,  7)  =  C  (4) 

.  22  B  (4,  7)  =  A  (3,  8)  +  B  (4,  7)  . 

1  ! 

The  value  A  (3,  8)  used  in  statement  22  is  the  one  computed  in  statement 

i 

21  during  execution  of  the  loop  body  for  the  point  (3,8).  To  ensure  that 


the  execution  for  (4,7)  computes  the  right  value  when  we  change  the  order 
of  executions  of  the  body,  we  must  only  require  that  it  be  preceded  by  the 
execution  for  (3,8).  By  statement  E  above,  this  means  that  tt  must 
satisfy  tt[  (3,  8)  ]  <tt[  (4,  7)  ]  . 

In  general,  let  VAR  be  any  variable.  If  a  generation  and  a  use  of 
VAR  both  reference  the  same  element  in  the  range  of  VAR  during  execution 
of  the  loop,  then  the  order  of  the  references  must  be  preserved.  In  other 
words,  if  f  is  a  generation  and  g  is  a  use  of  VAR,  and  Tf(P)  =  T  (Q) 
for  some  points  P,  Q  e  \J) ,  then: 

(i)  If  P <  Q,  we  must  have  tt(P)  <  tt(Q)  , 

(li)  If  Q  <  P,  we  must  have  rr(Q)  <  n (P)  . 

In  the  above  example,  T  T  (3 , 8)  1  =  T  [  (4,7)  ]=  (3,8)  and 

al  a2 

(3 , 8)  <  (4,7),  so  we  must  have  tt  C  (3,8)  1  <  tt  t'  (4,7)  1 .  Note  that  if 

P  =  Q,  then  the  order  of  execution  of  the  references  will  automatically  be 

preserved,  since  they  happen  during  a  single  execution  of  the  loop  body. 

Thus,  the  fact  that  T,  T  (4,7)  ]  =  TK  T  (4,7)  ]  does  not  place  any 

D1  d2 

restriction  on  our  choice  of  tt. 

The  above  rule  should  also  apply  to  any  two  generations  of  a 
variable.  This  guarantees  that  the  variable  has  the  correct  values  after 
the  loop  is  run.  Together  with  the  above  rule,  it  also  ensures  that  a  use 
will  always  obtain  the  value  assigned  by  the  correct  generation. 

These  remarks  can  be  combined  into  the  following  basic  rule: 
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(Cl)  For  every  variable,  and  every  ordered  pelf  of  occurrences 
f,g  of  that  variable,  at  least  one  of  which  is  a  generation;  lf_ 

Tf(P)  =  Tq(Q)  for  P,  Q  e  (j)  with  P  <  Q,  then  tt  must  satisfy  the  relation 


Notice  that  the  case  Q  <  P  is  obtained  by  interchanging  f  and  g. 

Rule  Cl  ensures  that  the  new  ordering  of  executions  of  the  loop 
body  preserves  all  relevant  orderings  of  variable  references.  The 
orderings  not  necessarily  preserved  are  those  between  references  to 
different  array  elements,  and  between  two  uses.  Changing  just  these 
orderings  cannot  change  the  value  of  anything  computed  by  the  loop. 

The  assumptions  we  have  made  about  the  loop  body,  especially  the 
assumption  that  it  contains  no  exits  from  the  loop,  therefore  imply  that 
rule  Cl  gives  a  sufficient  condition  for  loop  (2)  to  be  equivalent  to  loop 
(1).* 


For  most  loops,  Cl  is  also  a  necessary  condition. 

v 
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V.  THE  SETS  <f,  q> 


The  trouble  with  rule  Cl  Is  that  It  requires  that  we  consider  many 

pairs  of  points  P,  Q  In^f .  For  the  loop  (6),  there  are  112  pairs  of 

elements  P,  Q,  e  $  with  Ta  (P)  =  T=  (0)  and  P  <  Q.  However, 

al  a2 

T  (P)  =  T  (Q)  If  and  only  If  Q  =  P  +  (1 ,  -1).  We  would  like  to  be  able 
al  a2  ' 

to  work  with  the  single  point  (1,  -1)  e  S,  rather  than  all  the  pairs  P,Q. 

This  suggests  the  following  definition.  For  any  occurrences  f, 
g  of  a  variable  In  loop  (1),  define  the  subset  <f,  g>  of  22n  by 

<f,  g>  =  (  X  :  Tf(P)  =  Tg(P+X)  for  some  P  e  Zn  }  . 

Thus,  for  loop  (6)  we  have  <al ,  a2>  =  f  (1 ,  -1)  } ,  and 

<bl,  b2>  =  (  (0,0)  ) .  Observe  that  <f,  g>  is  Independent  of  the  index 

set  i_p  . 

We  now  rewrite  rule  Cl  In  terms  of  the  sets  <  f,  g>.  First, 

note  that  tt  (P  +  X)  =  n(P)  +  n(X),  since  we  have  assumed  tt  to  be  a 

linear  mapping.  (Recall  the  definition  of  tt,  and  formula  (5).)  Also, 

» 

remember  that  A  <  A  +  B  if  and  only  If  B  >  0.  Then  just  substituting 
P  +  X  for  Q  In  rule  Cl  yields: 

(C 1' )  For  . . .  generation;  If  Tf  (P)  =  Tg  (P  +  X)  for 
P,  P  +  X  etJ?  with  X  >  (f,  then  tt  must  satisfy 
the  relation  n(X)  >  0. 

Removing  the  clause  "for  P,  P  + ;•  "  from  Cl  gives  a  stronger 
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condition  for  tt  to  satisfy.  Doing  this  and  using  the  definition  of 
<f,g>  then  gives  the  following  more  stringent  rule: 

(C2)  For  every  variable,  and  every  ordered  pair  of 
occurrences  f,g  of  that  variable,  at  least 
one  of  which  is  a  generation;  for  every 

— * 

X  e  <f,q>  withX  >  0,  tt  must  satisfy  tt(X)>  0. 

Any  tt  satisfying  C2  also  satisfies  Cl.  Hence,  rule  C2  gives 
a  sufficient  condition  for  loop  (4)  to  be  equivalent  to  loop  (1).  Moreover, 
C2  is  independent  of  the  index  set  ^ K 

n  0 

Note  here  that  C2  is  satisfied  by  the  zero  mapping  tt:  2&n-»  2£u 
ii'  and  only  if  it  is  vacuous;  i.e. ,  if  and  only  if  there  are  no  elements 
X  >  ^  in  any  of  the  sets  <  f,g>  referred  to  in  the  rule.  In  this  case, 
the  loop  body  can  be  executed  concurrently  for  all  points  in<^. 


VI.  COMPUTING  THE  SETS  <f.q> 


We  will  obtain  results  about  the  existence  of  mappings  rt 
satisfying  rule  C2.  In  order  to  do  this,  some  restrictions  must  be 
made  on  the  forms  of  the  occurrences  to  permit  a  simple  computation 
of  the  sets  <f,g>.  We  assume  that  each  occurrence  of  a  variable 
VAR  Is  of  the  form 

(7)  VAR  (I  1  +  m\  . ..,  I  r  +  m  r)  , 

lc 

where  the  m  are  Integer  constants,  and  j^,  . ..,  j  are  r  distinct 
Integers  between  1  and  n.  Moreover,  we  assume  that  the  j.  are  the 

same  for  any  two  occurrences  of  VAR.  Thus,  If  an  occurrence 

2  14  2 

A  (1  -1,1,1  +1)  appears  In  the  loop,  then  the  occurrence  A  (I  +  1, 

14  12  4 

I  +6,1)  may  appear.  However,  the  occurrence  A  (I  -1,1,1) 

may  not  . 

Now  let  f  be  the  occurrence  (7)  and  let  g  be  the  occurrence 
VAR  (I  1  +  n1,  ...,  I  r  +  nr)  .  Then 

TjC  (p  ,  •«.,  p  )]  ~(p  +  m^ ,  ...,  p  +  mp 
Tg  [  (p1 ,  . . . ,  pn)  ]  =  (p  1  +  nx ,  . . . ,  p  r  +  np 

It  Is  easy  to  see  from  the  definition  that  <  f,g>  Is  the  set  of  all 

n  th  k  k 

elements  of  %  whose  j^-  coordinate  is  m  -  n  ,  for  k  =  1 ,  . . . ,  r, 

and  whose  remaining  n  -  r  coordinates  are  any  integers. 
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As  an  example,  suppose  n  -  5  and  f,g  are  the  occurrences 
VAR  (I3  +  1,  I2  +  5,  I5),  VAR  (I3  +1,  I2,  I5  +1).  Then  <f, g>  = 
f  (x,  5,  0,  y,  -l):x,  y  s  22}  .We  will  denote  this  set  by  (*,  5,  0,  *,  -1), 
so  means  "any  integer". 

The  index  variable  IJ  is  said  to  be  missing  from  VAR  if  r*  is 
jk  i 

not  one  of  the  I  in  (7).  It  is  clear  that  IJ  is  missing  from  VAR  if  and 

only  if  the  set  <f,g>  has  an  *  in  the  j—  coordinate,  for  any  occurrences 
f,g  of  VAR. 
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VII.  THE  HYPERPLANE  THEOREM 


IJ  is  called  a  missing  index  variable  if  it  is  missing  from 
some  generated  variable  in  the  loop;  i.e. ,  if  it  is  missing  from  some 
variable  for  which  a  generation  appears  in  the  loop  body. 

The  following  result  is  an  important  special  case  of  a  more 
general  result  which  will  be  given  later.* 

Hyperplane  Concurrency  Theorem:  Assume  that  none  of  the  index 
variables  I  ,  ....  In  is  a  missing  variable.  Then  loop  (1)  can  be 
rewritten  in  the  form  of  loop  (4)  for  k  =  1 .  Moreover,  the  mapping 
T  used  for  the  rewriting  can  be  chosen  to  be  independent  of  the  index 
set  vP . 

Proof:  First,  a  mapping  n:  2Zn-*  %  will  be  constructed  which  satisfies 
rule  C2.  Let  P  be  the  set  consisting  of  all  the  elements X  >  0  of  all 
the  sets  <f,g>  referred  to  in  C2 .  We  must  construct  rr  so  that 
tt(X)  >  0  for  all  x  e  P  . 

9  n 

Let  "+"  denote  any  positive  integer,  so  (+,  x  ,  . . . ,  x  )  is  any 
element  of  25n  of  the  form  (x,  x^,  ...,  xn)  with  x>  0.  Since  I  * 


*A  weaker  version  of  this  result  can  be  found  in  [2]  . 
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is  the  only  index  variable  which  may  be  missing,  we  can  write 
-P  =  {X, ,  ...»  X  }  ,  where 

A  A  “ 

/  1  n% 

(xf,  . ..,  xr) 


X 


r 


< 


or 


(+,  xj,  . . . ,  xj1) 

for  some  integers  x^  . 

The  mapping  tt  is  defined  by 

(8)  tt[  (I1,  In)  ]  =aj  I1  +  ...  +anIn 


for  non-negative  integers  a^,  to  be  chosen  below.  Since  a}  S  0, 

rr C  (  1,  x^,  . . . ,  xrn)  ]  >  0  implies  rr[  (x,  x^,  . . . ,  x")  ]  >  0  for 

2  n 

any  integer  x  >  0  .  Therefore,  each  Xf  of  the  form  (+,  x^,  . . . ,  xj;) 

9  n 

can  be  replaced  by  Xf  =  (1,  x^,  . . . ,  x")  and  it  is  sufficient  to 

construct  tt  such  that  tt  (Xr)  >  0  for  each  r=l,  ...,N.  Recall 

that  each  X  >  0. 
r 


Define  fi  =  {Xr  :  x*  =  . . .  =  x^-1  =  0,  x*./  0  } ,  so 

th 

is  the  set  of  all  Xr  whose  j—  coordinate  is  the  left-most  non-zero 

one.  Then  each  X  is  an  element  of  some  P .. 

r  '  j 

Now  construct  the  a^  sequentially  for  j=n,  n-1,  ...,  1 
as  follows.  Let  be  the  smallest  non-negative  integer  such  that 

Vr+  •••  +anxrn  >0 

for  each  Xr  =  (0,  . . . ,  0,  xj[,  . . . ,  x^1)  e  Py  Since  Xr  >  0  and 
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xj[  0  implies  xj[  >  0  ,  this  is  possible. 

Clearly,  we  have  tt  (Xf)  >  0  for  all  Xr  e  /^.  But  each 
Xf  is  in  some  y1®.,  so  n  (Xr)  >  0  for  each  r  =  1,  . . . ,  n.  Thus,  tt 
satisfies  rule  C2.  Observe  that  the  first  non-zero  a^  that  was  chosen 
must  equal  1,  so  1  is  the  greatest  common  divisor  of  the  a^.  (If  all 
the  aj  are  zero,  then  must  be  empty,  so  we  can  let  tt  [  (I*,  . . .,  In)  ] 
=  1 1  . )  A  classical  number  theoretic  calculation,  described  on  Page  31 
of  [3],  and  reproduced  in  Appendix  A,  then  gives  a  one-to-one  linear 
mapping  J:  22n-»  Zn  such  that 

J  C  (I1 . In)  ]  “  (  TT  [  (I1 . In)  ],  ...  ) 

Since  the  sets  <  f,g>  are  independent  of  the  index  set 
the  construction  of  tt  and  J  given  above  is  also  independent  of  Jp . 

This  completes  the  proof.  £ 

Loop  (4)  for  k  -  1  executes  the  loop  body  concurrently  for  all 
points  in  lying  along  parallel  (n  -  l)-dimenslonal  hyperplanes, 
hence  the  name  of  the  theorem. 

Observe  that  the  theorem  is  trivially  true  without  the  restriction 

that  J  be  Independent  of  ,  because  given  any  set -f  we  can  construct 

a  J  for  which  the  sets  do  _  contain  at  most  one  element, 

J.  ....  f 

and  the  order  of  execution  of  the  loop  body  is  unchanged.  For  example  , 
if  vf  =  {  (x, y,z) :  I?  xs  10,  Uyi  5,  U  zl  7  ),  let  J  [  (x,y,z)  ]  = 
(35  x  +  7y  +  z,  x,  y).  Such  a  J  is  clearly  of  no  Interest.  However, 


1-22 


because  the  mapping  J  provided  by  the  theorem  depends  only  on 
the  loop  body,  it  will  always  give  real  concurrent  execution  for  a 
large  enough  index  set. 

Condition  C2  gives  a  set  of  constraints  on  the  mapping 
tt:  Zn-«  TL.  The  Hyperplane  Theorem  proves  the  existence  of  a  rr 
satisfying  those  constraints.  We  now  consider  the  problem  of  making 
an  optimal  choice  of  tt. 

It  seems  most  reasonable  to  minimize  the  number  of  steps  in 
the  outer  DO  J1  loop  of  loop  (4).  (Remember  that  k  =  1.)  If  a 
sufficiently  large  number  of  processors  are  available,  then  this  gives 
the  maximum  amount  of  concurrent  computation.  This  means  that  we 
must  minimize  p1  -  X1  in  loop  (4).  But  X1  and  n1  are  just  the 
upper  and  lower  bounds  £tt  <P) :  Pe^}.  Setting 

M1  =  u1  -  jj1  , 

it  is  easy  to  see  that  u1  -  X1  equals 

(9)  M1  |  a1  |  +  ...  +Mn  |  an  |  , 

where  the  at  are  defined  by  (8) .  Finding  an  optimal  rr  is  thus 
reduced  to  the  following  Integer  programming  problem:  find  integers 
a^,  ...,  an  satisfying  the  constraint  inequalities  given  by  rule  C2, 
which  minimize  the  expression  (9). 

Observe  that  the  greatest  common  divisor  of  the  resulting  a^ 


must  be  1.  This  follows  because  the  constraints  are  of  the  form 
x  aL  +  ...  +x  an  >  0  , 

so  dividing  the  a^  by  their  g.c.d.  gives  new  values  of  a^  satisfying 
the  constraints,  with  a  smaller  value  for  (9).  Hence,  the  method  of 
[3]  can  be  applied  to  find  t.he  mapping  J. 

Although  the  above  integer  programming  problem  is  algorithmically 
solvable,  we  know  of  no  practical  method  of  finding  a  solution  in  the 
general  case.  However,  the  construction  used  in  proving  the  Hyperplane 
Theorem  should  provide  a  good  choice  of  tt  .  In  fact,  for  most 
reasonable  loops  it  actually  gives  the  optimal  solution. 
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VIII.  AN  EXAMPLE 


Now  consider  the  following  loop: 


DO  16 

X1-!. 

25 

DO  16 

I2  =  2, 

19 

DO  16 

I3  =2, 

29 

F  (I2,  I3) 

=  (F  {I2 

+  1,  I3)  +  F  (I2,  I3  +  1) 

© 

1 

©  @ 

X 

+  F  (I2 

-  1,  I3)  +F  (I2,  J3  -  1)  )  * 

©  © 


16  CONTINUE  . 

It  is  a  simplified  version  of  a  standard  relaxation  computation  for  a 
20  by  30  array  F,  performed  25  times. 

To  apply  the  method  of  analysis,  first  perform  the  following 
calculations: 

1.  Compute  the  sets  <f,g>  referred  to  by  rule  C2. 

2.  Find  all  elements X  >  0  in  these  sets. 

mVi 

3.  Find  the  constraints  on  the  a^  implied  by  n(X)  >  0  . 
This  is  done  in  Table  i . 

Next,  choose  a2»  a3  consistent  with  these  constraints , 
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Table  1 


I 


I 

i  ; 

! 

t 

I 

r 

I*  .  I 

:  .  *  i 

i 

I 

and  minimizing  '  1 

1 

*  *  ,  j  . 

24  |  ax  |  +17  |  a,  |  +  27  I  a3  |  . 

*  i 

It  is  easy  to  see  that  the  solution  to  this  problem  is  a1  =  2,  ~  1* 

a^  =  1,  so  tt  is  given  by 

!  *  • 

r  ,.1  t2'3»-j'  _«T1,T2,_3 

tt[  (I  ,  I  ,  I  )  1  -  2  I .  +  I  ti  ■  . 

i  , 

Note  that  this'  is  the  tt  computed  by  . the  algorithm  used  in  the  proof  of 
the  Hyperplane  Theorem. 

Application  of  the  algorithm  described  in  the  appendix  gives  the 
following  mapping  J  : 

1  ..,.1  .2  .3  .  .  _  1  .2  3*  .1..2  ,  t3  t1  ,3w 

J  L  ( I  /  I  #  1  )]-(J  >  J  »  J  )  -  (2  I  +  I  +  I;  ,  I  ,  I  )  .* 

Using  this,  and;  the  inverse  relation  i 

(I1,  I2,  I3)  =<J2,  I3)  , 

.  ! 

the  above  loop  is  rewritten  as  follqws:  > 

i 

•  DO  16  J1  =  6,  98 

i 

DO  16  CONC  FOR  ALL  (J4,  J3)  e  {  (x,y):  . 

X  lsxs25,  2sys29  and  J1  -  19  £  2x  +  y  £  J1  -  2  } 

,  F  (J1  -  2  *  J2  -  J3,  J3)  =  (F  (J1  -  2  *  J2  -  J3  +  1,  J3)  + 

X  1  F  (J1  -  2  *  J2  -  J3 ,  J3  +  1)  +  F  (J1  -  2  *  J2  -  J3  -  1 ,  J3) 

_ _ _ f _ 

*  i 

*It  is  also  easy  to  obtain  this  from  the  following  fact:  the  mapping 
J:  2£nV;2n  defined  by  (5)  is  one-to-one  if  and  only  if  the  determinant 

>  i 

of  the  aj  Is  i  1  . 

I 

i  i  1 
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» 


X 


+F  (J1  -  2  *  J2  -  J3,  J3  -  1)  )*  .25 


16  CONTINUE  . 

The  set  expression  in  the  DO  CONC  statement  was  obtained  by 

writing  the  relations  1  £  I*  s  25,  2  I2  ^  19  and  2  :£  I3  ^  29  in 

,  ,  _1  _2  .3 

terms  of  J  ,  J  ,  J  . 

To  understand  why  the  rewritten  loop  gives  the  same  results, 
consider  the  computation  of  F  (4,  6)  in  the  execution  of  the  original 
loop  body  for  the  element  (9,  4,  6)  e  J/  .  It  is  set  equal  to  the  average 
of  its  four  neighboring  array  elements:  F  (5,  6),  F  (4,  7),  F  (3,  6), 

F  (4,  5).  The  values  of  F  (5,  6)  and  F  (4,  7)  were  calculated  during 
the  execution  of  the  loop  body  for  (8,  5,  6)  and  (8,  4,7),  respectively; 
i.e. ,  during  the  previous  execution  of  the  DO  I1  loop,  with  I1  =  8. 

The  values  of  F  (3,  6)  and  F  (4,  5)  were  calculated  during  the  current 
execute  f  the  outer  DO  loop,  with  I*  =  9 .  This  is  shown  in  Figure  3. 

Now  consider  the  rewritten  loop.  At  any  time  during  its 
execution,  F  (p,q)  is  being  computed  concurrently  for  up  to  half  the 
elements  (p,q)  in  the  range  set  ^2  „  of  F  .  These  computations  are 

r 

for  different  values  of  I* .  Figure  4  illustrates  the  execution  of  the 
DO  CONC  for  J*  =27.  The  points  (p,q)  e  ^  for  which  F  (p,q)  is 
being  computed  are  marked  with  “x"s,  and  the  value  of  I*  for  the 
computation  is  indicated.  Figure  5  shows  the  same  thing  for  J1  =28. 
Note  how  the  values  being  used  In  the  computation  of  F  (4,  6) 
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in  Figure  5  were  computed  in  Figure  4.  A  comparison  with  Figure  3 
illustrates  why  this  method  of  concurrent  execution  is  equivalent  to 
the  original  algorithm. 

The  saving  in  execution  time  achieved  by  the  rewriting  will 
depend  upon  the  amount  of  overhead  in  the  Implementation  of  the  DO 
CONC,  as  well  as  the  actual  number  of  processors  available.  (The 
sets  in  the  DO  CONC  statement  contain  up  to  252  elements.)  However, 
the  value  of  this  approach  is  indicated  by  the  fact  that  the  number  of 
sequential  iterations  has  been  reduced  by  a  factor  of  over  135.  (However, 
we  must  point,  out  that  a  real  program  would  probably  include  a 
convergence  test  within  the  outer  DO  I1  loop,  so  the  analysis  could 
only  be  applied  to  the  inner  DO  I  /DO  I  loop.) 
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IX.  THE  GENERAL  PLANE  THEOREM 


We  now  generalize  the  Hyperplane  Theorem  to  cover  the  case 

2  n 

when  some  of  the  index  variables  I  ,  . . , ,  In  are  missing.  Concurrent 
execution  is  then  possible  for  the  points  in  flying  along  parallel 
planes.  Each  missing  variable  may  lower  the  dimension  of  the  planes 
by  one. 

Plane  Concurrency  Theorem:  Assume  that  at  most  k  -  1  of  the  index 
2 

variables  I  ,  . . . ,  In  are  missing.  Then  loop  (1)  can  be  rewritten  in 
the  form  of  loop  (4).  Moreover,  the  mapping  J  used  for  the  rewriting 
can  be  chosen  to  be  Independent  of  the  index  set  . 

Proof;  The  proof  is  a  generalization  of  the  proof  of  the  Hyperplane 
■*2 

Theorem.  Let  I  ,  ....  I  be  the  possibly  missing  variables  among 
2  n 

I  /  •  •  •  §  I  •  Set  j|  —  1/  ~~  n  1  ♦  and  assume  ^  J2  ^  ^  jj^  ^ 

Let  P  be  the  set  of  all  elements  X  >  0  of  all  the  sets  <f,g> 

referred  to  by  rule  C2.  We  must  construct  tt  so  that  tt  (X)  >  0  for  all 

x  e.  P  .  Let  'Pj  =  f  (0 ,  . . . ,  0 ,  x^ ,  . . . ,  xn)  e  fl ;  x^  >  0}  ,  so  P ^ 

is  the  set  of  all  elements  of  P  whose  J—  coordinate  is  the  left-most 

non-zero  one.  Then  every  element  of  P  is  in  one  of  the  iP^  . 

n  Ic 

The  mapping  tt:  %  •*  %  will  be  constructed  with 

tt  (P)  =  (  tt1  (P) . TTk  (P)  )  , 

where  each  tt1  :  Zn-*  Z  will  be  defined  by 
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i  r  /T1  Tn.,  i  T1  ,  1  Tn 

tt  [  (I  ,...,!  )  j  =a^I  +  . . .  +  afi  I 

for  non-negative  integers  aj  .  Moreover,  we  will  have  a1  =  0  if  j  <  ji  or 
j  >  j.  This  implies  that  is  X  e  P.  and  j  >  J.  . .  Then  tt1(X)  =  0.  It  there- 

*"1+1  J  It  I 

fore  suffices  to  construct  n1  so  that  for  each  j  with  j.  <  j  <  j.  , ,  and  Xe  P.: 

i  —  l+i  j 

tt1  (X)  >  0  -  for  we  then  have 

tt(X)  =  (0 , . . .  ,0 ,  tt1  (X) , . . . ,  TTk(X)  )  >  0 . 

Recall  that  for  the  sets  <f,g>,  an  *  can  appear  only  in  the 

Jl , . . .  Jr  coordinates.  Thus  any  element  of  any  of  the  sets  P.  with 

j.  <  j  <j.  ,  can  be  represented  in  the  form 
i  1+1  .  , 

(0,...,0,  x'1 . xrx+  1  '•••)'  or 


Ji+  1 


L,  .  “i 

(0  ,  . . .  ,0  ,  + ,  xriT  1 , . . . ,  xr  1  ,...) 


for  a  finite  collection  of  integers  xj.  ,  j.  <  j  <  j^.  By  the  same  argument 
used  in  the  proof  of  the  Hyperplane  Theorem,  we  can  replace  "+"  byxf  =1, 


and  choose  a*>  0,  ji  <J  <  ji+j  such  that 


i  ji  i  ji+rx  n 

a.  x  +  . . .  +  a.  _i  x  >0 

Ji  r  .  Ji+1  1 

for  each  r.  Choosing  a!  =  0  for  j  <  j.  and  j  >  j.  ,  completes  the  construction 

J  l  1+1 

of  the  required  tt1. 

The  construction  given  in  Appendix  A  is  then  applied  to  give 
invertable  relations  of  the  form 


ji  i  ji  i 

J  =  I  +. . .+  a. 


i 


ji+r2 


I 


i  J‘+rl 

r  =  s 

J, 


b>  Ir 


f°r  ii<i  <Ji+1 
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Combining  these  and  reording  the  Jj  gives  the  required  mapping  J.  g 

As  in  the  hyperplane  case,  to  get  an  optimal  solution  we  want 

to  minimize  the  number  of  iterations  of  the  outer  DO  loops .  This  means 
11  k  k 

minimizing  (p  -X  +  1)  . ..  (p  -X  +  1).  Using  the  notation  of  (5),  it  .s 
easy  to  verify  that  this  number  is  equal  to 

(11)  (M1  J  a}  |  +  ...  +Mn  |  a2n  |  +  1 )  ...  (M1  |  a*  |+...  +  Mn|a*|  +1), 
where  -  jfc*. 

Finding  the  aj  is  now  an  integer  programming  problem.  Note 

that  a  solution  with  a  j  =  . . .  =  a  ^  =  0  for  some  i  gives  a  solution  to  the 

rewriting  problem  with  k  replaced  by  k-1,  since  that  tt1  can  be  removed 

without  affecting  the  constraint  inequalities.  The  Plane  Concurrency  Theorem 

n  k 

proves  the  existence  of  a  tt:  &  —  2  satisfying  C2 ,  for  a  particular  value 

of  k.  It  may  be  possible  to  find  such  a  n  for  a  smaller  k. 

For  completeness,  we  will  state  a  sufficient  condition  for  the 
loop  body  to  be  concurrently  executable  for  all  points  in*-^  .  This  is  the  case 
when  the  zero  mapping  (tt (P)  =  0  for  all  P  e  Zn)  satisfies  C2.  Since  <g,f>  = 

(  -X  :  X  e<f,g  >  ) ,  it  is  clear  that  this  is  true  if  and  only  if  all  the  sets 
<f,  g>  are  equal  to  (0} ,  Finally,  the  rules  for  computing  the  sets  <f,g> 
give  the  following  rather  obvious  result: 

If  none  of  the  index  variables  are  missing,  and  for  any  generated 
variable,  all  occurrences  of  that  variable  are  identical,  then  loop  (1)  ^an  be 
rewritten  as  a 

DO  a  CONG  FOR  ALL  (I1, . . .  ,In)  e  ^ 

loop. 
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The  hypothesis  means  that  in  the  expression  (7)  for  any 
generated  variable  VAR,  r  =  n  and  the  m1  are  the  same  for  all  occurrences  of 
VAR. 


CHAPTER  TWO 


SYNCHRONOUS  PROCESSORS 


1 .  The  DO  SIM  Statement 


We  now  consider  the  case  of  completely  synchronous  processors, 
the  primary  example  being  the  1LLIAC-IV.  To  accomodate  .it,  let  us  introduce 
the  DO  SIM  (for  SIMultaneously)  statement,  having  the  following  form: 

DO  a  SIM  FOR  ALL  (l1 , . . .  ,Ik)  e 

where  ^is  a  subset  of  22^ .  Its  meaning  is  similar  to  that  of  the 
DO  CONC  statement,  except  that  the  computation  is  performed  synchronously 
by  the  individual  processors .  Each  point  of  is  assigned  to  a  separate 
processor,  and  each  statement  in  the  range  of  the  DO  SIM  is,  in  turn, 
simultaneously  executed  by  all  the  processors .  An  assignment  statement 
is  executed  by  first  computing  the  right-hand  side,  then  simultaneously 
performing  the  assignment. 

As  an  example ,  consider 

DO  15  SIM  FOR  ALL  I  e  {  x  :  2  <  x  <  10  } 

14  A  (I)  =  ACM)  +  B  (I) 

15  B(I)  *  A  (I)  **  2 

The  right-hand  side  of  statement  14  is  executed  for  all  I  before  the  assignment 
of  A(I)  is  made,  and  before  statement  15  is  executed.  Therefore,  if  initially 
A (4)  =  5  and  B(5)  =  2,  then  executing  the  loop  sets  A(5)  =  7  and  B(5)  =  49. 

Because  of  the  simultaneity  of  execution  of  the  body  for  the 
various  points  of  J  0  we  cannot  allow  any  conditional  transfer  of  control  in 
the  loop  body  which  depends  upon  the  index  variables.  E.  g. ,  the  statement 
IF  (A (I))  3,  4,  5 

may  not  appear  in  a  "DO  SIM  FOR  ALL  I"  loop. 
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I 


i  : 

For  simplicity,  assume  that  there  is  no  transfer  of  control  within 
the  body  of  the  loop,  so  the  statements  are  always  executed  sequentially 
in  the  order  in  which  they  appear. 

We  will  allow  conditional  assignment  statements  such  as 
IF  (A(I).EQ.  0)  B(I)  =  3. 

They  are  easily  implemented  on  the  ILLIAC-IV  because  of  its  ability  to  turn 
off  individual  processors. 

The  only  other  restriction  to  be  made  on  the  body  of  a  DO  SIM  loop 
is  that  a  generation  may  not  reference  the  same  array  element  for  two  different 
points  in  its  index  setsj  .  I.e. ,  an  assignment  statement  may  not  have 
two  different  processors  simultaneously  storing  values  into  a  single  memory 
location.  We  do  allow  them  to  simultaneously  load  a  value  from  a  single 
memory  location,  so  this  restriction  is  not  made  for  uses  of  a  variable.  * 


*  Simultaneous  loads  from  a  single  memory  location  are  implemented  in  the 

ILLIAC-IV  by  the  ability  of  the  central  control  unit  to  broadcast  a  value  to  all 

! 

processors . 
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1  ,  ' 
i 

}  I  ‘ 

!  ,  '  • 

'  !  i  I 

I  ! 

I 

!  ;  :  II.  .  Rewriting  the  Loop  ' 

,  i  1  '  ,  ! 

Now  consider  the  problem  of  rewriting  the  given  loop  (1)  in  the 

i  « 

i  "  '  form  ! 

(12)  DOajf^.u1- 

i 

;  .  1 

,  <* 

i  *  '  *  *  j 

%  k  k  , 

DO  aj  =  X  ,  n 

•  i  DO  a  SIM  FOR  ALL  (jk+1, ... .  ,Jn  )  e  <frl  Tk 

:  !  ■  i  I  J 

*  •  p'  ■  .  .......  — — 

loop  body 

l 

1  a  CONTINUE 

,  i 

•  < 

This  is  |lie  same  as  loop  (4),  except  the  DO  CONC  is  replaced  by  a  DO  SIM. 

i 

i  We  assume  that  the  body  of  loof}  (1)  contains  no  control  transfering  statements 

,-’i.e.,  no  GO  TO  ormumerical  IF  statements. 

i 

'  Define  the  mappings  J  and  tt  as  before .  Any  DO  CONC  statement 

can  be  executed  as  a  DO  SIM,  since  it 'must  give  the  same  result  if  the 

i  i 

(  asynchronous  processors  happen  to  be  synchronized.  Thus,  the  rewriting 
could  be  done  just  as  before  by  finding  a  tt  which  satisfies  C2 .  However, 
the  synchrony  of  the  computation  will  allow  us  to  weaken  the  condition  C2. 

i  j  i 

i  # 

.  Recall  that  rule  Cl  was  made  so  that  the  rewriting  will  preserve 

the  order  in  which  two  different  references  are  made  to  the  same  array 

,  '  i 

element.  For  references  made  during  two  different  executions  of  the  loop  body, 

"  i 

the  asynchrony  of  the  processors  requires  that  the  order  of  those  executions 

*  5  ? 

be  preserved.  However,  with  synchronous  processors,  we  can  allow  the 


two  loop  body  executions  to  be  done  simultaneously  if  the  references  will  then 
be  made  in  the  correct  order.  The  order  of  these  two  references  is  determined 
by  the  positions  within  the  loop  body  of  the  occurrences  which  do  the 
referencing. 

For  two  occurrences  f  and  g ,  let  f  «  g  denote  that  the  execution 
of  f  precedes  the  execution  of  g  within  the  loop  body.  This  means  that  either 
the  statement  containing  f  precedes  the  statement  containing  g,  or  else  that 
f  is  a  use  and  g  a  generation  in  the  same  statement.  The  above  observation 
allows  us  to  change  rule  Cl  to  the  following  weaker  condition  on  tt: 

For  . . .  generation:  if  Tf  (P)  -  T  (0) 
for  P,  Q  e  with  P  <r  0,  then  we  must  have  either 

(i)  n  (P)  <  tt(Q)  ,  or 

(ii)  rr  (P)  =  n(Q)  and  f  «  g . 

In  this  rule,  either  (i)  or  (ii)  is  sufficient  to  insure  that  occurrence  f  reference 
Tf  (P)  for  the  point  P  e  Jl  before  g  references  the  same  element  for  Q  e-J-  . 

The  conditions  can  be  rewritten  in  the  following  equivalent  form: 

(i)  tt  (P)  <  tt  (0)  and  (ii)  if  tt  (P)  =  tt  (2)  then  f  «  g . 

In  the  same  way  c2  was  obtained  from  cl,  the  above  rule  gives 
the  following: 

(SI)  For  every  variable  and  every  ordered  pair  of 
occurrences  f,  q  of  that  variable,  at  least  one 
of  which  is  a  generation:  for  every 
y  g  <  f ,  cr>  with  y  >  0 ,  we  must  have: 


(i)  tt(X)£  0,  and 
(ii)  if  vt  (X)  =  0  ,  than  f  «  q . 

If  tt  satisfies  rule  SI,  then  it  satisfies  the  preceding  rule,  so  the  rewritten 

loop  (12)  is  equivalent  to  the  original  loop  (1) . 

We  have  been  assuming  that  in  rewriting  the  loop  body,  the  order 

of  execution  of  the  occurrences  was  not  changed.  I.e.,  the  were  replaced 

1  n 

by  expressions  involving  ,  but  nothing  else  was  done  to  the  loop 

body.  Now  let  us  consider  changing  the  order  of  execution  of  the  occurrences? 

That  is ,  we  may  change  the  position  of  occurrences  within  the  loop 
body.  For  example,  we  may  reorder  the  statements. 

Let  f«g  mean  that  f  is  executed  before  g  in  the  rewritten  loop 
body  (12) .  Then  rule  SI  guarantees  that  the  correct  temporal  ordering  of 
references  is  maintained  when  the  references  were  made  in  the  original  loop 
during  different  executions  of  the  loop  body.  Having  changed  the  positions 
of  occurrences  in  rewriting  the  loop  body,  we  now  have  to  make  sure  that 
any  two  references  to  the  same  array  element  made  during  a  single  execution 
of  the  loop  body  are  still  made  in  the  correct  order.  The  following  analogue 
of  ruled  handles  this: 

For  . . .  generation:  if  Tr(P)  =  T  (P)  for  some  , 

^  9 

and  f  precedes  q  in  the  original  loop  body.  Then  f  «  g . 


*  Remember  that  there  was  no  point  in  doing  this  before,  since  it  couldn't 
help  for  asynchronous  processors. 
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Rewriting  this  in  terms  of  the  sets  <i,g> gives  the  following 

rule . 

(S2)  For  every  variable,  and  every  ordered  pair  of 
occurrences  f ,  g  of  that  variable ,  at  least  one  of 
which  is  a  generation;  if  0  e  <f.  q>  and  f  precedes  q  in 
the  original  loop  body,  then  f  «  a. 

Rules  SI  and  S2  guarantee  that  the  rewritten  loop  (12)  is  equivalent  to  the 
original  loop  (1) .  Note  that  rule  S2  does  not  involve  tt. 


2-6 


Ill .  The  Coordinate  Method 


We  could  now  try  to  solve  the  following  problem:  Liid  a  rewriting 

of  the  loop  body  (and  the  resulting  «  relation  between  occurrences)  and  a 

mapping  tt  which  satisfy  rules  SI  and  S2,  and  which  minimize  the  expression 

(11) .  This  would  give  a  rewriting  of  the  loop  which  is  optimal  in  the  sense 
1  k 

that  the  outer  DO  J  /. . ./  DO  J  loop  has  the  fewest  iterations .  However, 
the  optimality  of  such  a  rewriting  is  illusory,  for  reasons  which  we  will 
now  discuss . 

The  ILLIAC-IV  has  64  processors.  The  feasibility  of  a  machine 
with  so  many  processors  is  achieved  by  having  all  processors  operate 
synchronously  with  a  single  control  unit,  and  by  allowing  each  processor  to 
access  only  its  own  separate  portion  of  memory.  If  processor  12  wants  to 
load  a  data  word  contained  in  processor  6's  part  of  memory,  then  the  follow¬ 
ing  sequence  of  instructions  is  executed  simultaneously  by  each  processor 
number  i,  for  i  =  0  to  63: 

(1)  load 

(2)  transmit  data  word  to  processor  i  +  7  (mod  64). 

This  means  that  the  method  of  storing  arrays  must  depend  upon 
how  they  are  to  be  accessed.  For  example,  consider  the  occurrence 
FQ1  -  2  *  J2  -  J3,  J3  )  inside  the  DO  CONC  FOR  ALL  (J2,  J3) ,  which  we 
generated  before  with  the  Hyoerplane  Theorem.  It  necessitates  a  complicated, 
space-wasting  format  for  storing  the  array  F.  The  array  would  probably 


2-7 


have  to  be  reformated  before  and  after  execution  of  the  outer  DO  J1  loop.  * 

It  appears  that  the  best  results  are  obtained  by  choosing  a 

mapping  J  which  gives  a  loop  with  simple  subscripting  and  a  reasonable 

amount  of  simultaneous  computation.  An  obvious  way  of  choosing  such  a  J 

1  n 

is  to  let  J*, . . .  Jn  be  a  permutation  of  the  original  index  variables  I  , ...  ,1  . 

More  precisely,  the  mapping  rt:  2n  -*22^  is  taken  to  be  a  coordinate  projection, 

-  that  is,  a  mapping  for  which  n[  (a*,  •  •  •  »an)]  is  obtained  by  deleting  n-k 
X  ri 

coordinates  from  (a  , . . .  ,a  ) . 

For  example,  for  n  =  5  we  may  want  to  rewrite  loop  (1)  as 
DO  a  I3  =  £3 ,  u  3 
DO  al^=  t ,  u  ^ 

DO  a  SIM  FOR  ALL  (I1,  I2,  I5  )  e  f  (xf  y,  z): 

/  <  x < u 1 ,  je2<  y  <u2,  and  /  <z  <  u5} 

ThenT.ni1,  I2,  I3,  I4,  I5)!  =  (I3,  I4)  and  J[(I* . I5)]  =(I3.  I4,  I1,  I2,  I5  >• 

Notice  that  if  rr  is  a  corrdinate  projection,  then  the  sets  J  j  ^  of  loop  (11) 

J  I  •  •  •  I J 

are  easy  to  compute. 

The  coordinate  method  consists  of  first  choosing  a  coordinate 
projection  tt,  and  then  trying  to  find  a  rewriting  of  the  loop  body  for  which 
SI  and  S2  are  satisfied.  Since  rewriting  the  loop  body  makes  no  difference 


*  A  precise  statement  of  the  rules  relating  storage  allocation  and  DO  SIMs  is 
contained  in  [  4]  . 
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to  rendition  (i)  of  SI,  we  must  first  require  that  it  be  satisfied  for  all 
relevant  occurrences  f,  g.  Next,  we  apply  SI  and  S2  to  get  certain  ordering 
relations  «  between  occurrences.  We  must  then  decide  if  it  is  possible 
to  rewrite  the  loop  body  so  that  these  relations  are  satisfied. 

In  order  to  make  this  decision,  we  need  a  trivial  observation: 
a  use  in  an  assignment  statement  must  precede  the  generation  in  that 
statement.  This  observation  will  be  given  the  status  of  a  rule. 

(S3)  For  any  use  f  and  generation  q 
in  a  single  statement,  we  must  have 

Now  we  add  the  relations  «  given  by  S3  to  those  obtained  from 
SI  and  S2.  Next,  we  add  all  relations  implied  by  transitivity.  I.  e. ,  when¬ 
ever  f  «  g  and  g  «  h,  we  must  add  the  relation  f  «h.*  If  the  resulting 
ordering  relations  are  consistent  -  that  is ,  if  we  do  not  have  f  «  f  for  any 
occurrence  f  -  then  the  loop  body  can  be  rewritten  to  satisfy  the  ordering 
relations.  We  will  describe  the  method  of  rewriting  the  loop  body  by  an 
example . 


*  An  efficient  algorithm  for  doing  this  is  given  by  [  5] . 


IV.  An  Examp* 


21 


22 


23 


24 


Consider  the  following  simple  loop: 

DO  24  I1  =  2,  50 
DO  24  I2  =  1,  5 

Ad1,  I2)=  Bd1,  i^  +  cd1) 
al )  (bT)  (c?) 


cd1)  =  Bd1  -l,  i2 ) 


Bd1,  I2  )  =  Ad1  +  1,  I2  )  **  2 

(b3)  Q) 


CONTINUE 


2  1 

We  want  to  rewrite  it  as  a  DO  I  /DO  SIM  FOR  ALL  I  loop,  so  we  apply  the 

12  2 

coordinate  method  with  the  coordinate  projection  tt defined  by  ti[  (I  ,1  )]  =  I  . 
We  proceed  as  follows .  (The  calculations  for  steps  1-3  are  shown  in  Table  2 .) 

1.  Compute  the  relevant  sets  <f,  g>  for  rules 

SI  and  S2. 

2,  Check  that  SI  (i)  is  not  violated. 

3,  Find  the  ordering  relations  given  by  Sl(ii)  and  S2. 

4.  Apply  S3  to  get  the  following  relations: 

statement  21:  bl  «  al 

cl  «  al 
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'm. 


' '  d 


The  Sets 

<L_g> 

Is  SI  (i) 

Ordering  Relations 

Violated  ? 

sum 

S2 

<al,  al> 

=  (0,0) 

NO 

- 

- 

<al,  a2> 

=  (“1/0) 

NO 

- 

- 

<a2,  al> 

=  (1/  0) 

NO 

a2«al 

- 

<b3,  b3> 

=  (0,0) 

NO 

- 

- 

<bl,  b3> 

=  (0,0) 

NO 

- 

bl«b3 

<b3,bl>  : 

=  (0,0  ) 

NO 

- 

- 

<b2,b3  > 

=  (-1,0) 

NO 

- 

- 

<b3,b2> 

=  (1,0) 

NO 

b3«b2 

- 

<clfcl>  = 

(0,0) 

NO 

- 

- 

<cl,  c2> 

=  (0,*) 

NO 

- 

cl«c2 

<c2 ,  cl> 

=  (0,*) 

NO 

- 

Table  2 


*v  V  ' 

■  „  i.  >*  A,?  -  Xt  w>  .  ./ 
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statement  22:  b2  «  c2 
statement  23:  a2  «  b3 

5 .  Find  all  relations  implied  by  transitivity: 


b3 

« 

c2 

[by 

b3  « 

b2  and  b2  «  c2] 

a2 

« 

b2 

[by 

a2  « 

b3  and  b3  «  b2] 

bl 

« 

b2 

[by 

bl  « 

b3  and  b3  «b2) 

bl 

« 

c2 

[by 

bl  « 

b2  and  b2  «  c2] 

a2 

« 

c2 

[by 

a2  « 

b3  and  b3  «  c2] 

6 .  Check  that  no  relation  of  the  form  f  «  f  was  found  in 
3  or  5 . 

7 .  Order  the  generations  in  any  way  which  is  consistent 
with  the  above  relations  -  i.e. ,  obeying  b3' «  c2. 
We  let  al  «  b3  «  c2 . 

We  then  write 

21  Ad1,  I2)  a 

© 

23  Bd1,  I2)  = 

© 

22  Cd1  )  = 

© 

8.  Insert  the  uses  in  positions  implied  by  the  ordering 
relations  (recall  that  a2  «  al): 

Ad1  +  1,  I2  ) 


21  Ad1,  I2  )  =  Bd1,  I2)  +  cd1) 

©  ®  © 
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23  BU1,  l2)  =  **2 

© 

22  Cd1)  =  B(I*  -1,  I2  ) 

@  © 

9.  Add  any  extra  variables  necessitated  by  uses  no 
longer  appearing  in  their  original  statements: 

Xtt1  +1,  I2)  =A(I1  +  1,  I2  ) 

© 


21 

n 

1— • 

*— t 

to 

*  bu1,  i2 )  +cuA) 

© 

® 

© 

23 

bu1,  i2) 

=  xu1  + 1,  i2) 

**2 

© 

22 

II 

O 

bu1  -i,  i2) 

©  © 


This  finally  gives  us  the  following  rewriting  of  our  original  loop: 
DO  24  I2  =  1,  5 

DO  24  SIM  FOR  ALL  I1  e  fx:  2  <  x  <  50] 

x  (i1  + 1,  i2)  =  A(r  +  i,  i2) 

21  Ad1,  I2)  =  BU1,  I2)  +  Cd1) 

23  BU1,  I2)  =  XU1  +  1,  I2)  **2 

22  cd1)  *  BU1  -1),  i2 

24  CONTINUE 


V.  Further  Remarks 


It  is  easy  to  deduce  a  general  algorithm  for  the  coordinate 
method  from  the  preceeding  example .  The  method  can  be  extended  to 
cover  the  case  of  an  inconsistent  ordering  of  the  occurrences.  In  that 
case,  the  loop  can  be  broken  into  a  sequence  of  sub-loops.  Every  genera¬ 
tion  g  for  which  the  relation  g  «  g  does  not  hold  can  be  executed  within  a 
DO  SIM  loop.  An  algorithm  for  doing  this  is  described  in  Chapter  4. 

Observe  that  there  are  only  2n-l  choices  of  a  coordinate 
projection  tt  for  rewriting  loop  (1) .  It  is  easy  to  try  them  all,  in  decreasing 
order  of  the  amount  of  parallel  computation  achieved,  until  one  is  found 
for  which  the  rewriting  is  possible.  Rule  SI  should  rapidly  eliminate  many 
choices . 

It  may  happen  that  the  rewriting  cannot  be  done  with  any 
coordinate  projection.  In  this  case,  a  more  general  linear  mapping  tt  must 
be  sought,  using  the  approach  developed  before  for  DO  CONC  loops.  For 
example,  no  coordinate  projection  works  for  the  relaxation  loop  (10). 
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PRACTICAL  CONSIDERATIONS 
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I.  Restrictions  on  the  Loot 


Now  consider  the 'application  of  these  methods  to  the  problem 

i 

of  compiling  a  FORTRAN  program  for  execution  on  a  multiprocessor  computer. 

We  immediately  observe  that  the  restrictions  which  have  been  placed  on  the 
loop  (1)  would  eliminate  most  real  Fortran  DO  loops  from  consideration.  For 

«  {  j;  t 

example,  the  DO  limits  Z  ,  n  are  usually  not  all  constants  known  at  compile 

i  '• 

time.  Fortunately,  most  of  the  restrictions  were  made  to  simplify  the  exposi- 

i 

tion,  and  are  not  essential.,  We  will  now  describe  the  restrictions  which  are 

i 

,  i 

essential  to  the  analysis  .*  ■ 

i 

First,  some  terms  must  be  defined.  By  a  "loop  constant" ,  we 

mean  an-  expression  whose  value  dbes  not  change  while  the  loop  is  executed  - 

!  * 

i.e. ,  any  expression  not  involving  generated  variables  or  loop  index  variables. 
A  quantity  is  "knowq  at  compile  time"  if  it  has  a  constant  value  whicti  can 

I  '  ! 

be  determined  by  the  FORTRAN  compiler. 

i 

i  !  * 

i  The  analysis  Can  be  applied  to  the  following  loop: 

'  (13)  DO  a  I1^  41,  p1,  d1 


„  Tn  ,  „n  n  ,n 
DO  al  =  Z  ,  u  ,  d 


loop  bbdy  . 


a  CONTINUE 


assuming  that  it  satisfies  the  following  conditions: 


i 


1 .  Each  d  is  known  at  compile  time . 

2..  The  loop  body  contains  no  transfer  of  control  to  any 
statements  outside  it. 

3.  There  is  no  I/O  statement  in  the  loop  body. 

4.  For  each  subroutine  or  function  call  in  the  loop  body,  it  is 
known  which  variable  elements  it  can  modify. 

5.  Each  occurrence  of  a  generated  variable  must  be  of  the  form 
VAR  (e\  ...,em),  with 

e  =  aj  *  .  +...+  an  *  I  +  c  # 
where  c*  is  a  loop  constant  and  each  aj  is  known  at  compile 
time. 

For  the  coordinate  method,  the  following  additional  assumptions  are  required. 

6 .  There  is  no  transfer  of  control  within  the  loop  body . 

7.  For  every  generated  variable  VAR,  each  occurrence  of  VAR 


within  the  loop  body  must  be  of  the  form 

uad  iJ  *  1  ,  J  ^m  *  Jm  ,  _m  » 

VAR  (a  *  I  +  c  , . . . ,  a  *  I  +c  ), 

where  ci  is  a  loop  constant,  a1  = +1  or  0,  and  the  a1  and  ^  are 

the  same  for  all  occurrences  of  VAR. 

By  weakening  the  restrictions,  many  complicated  details  are 
added  to  the  process  of  rewriting  the  loop.  However,  the  analysis  remains 
largely  unchanged.  Some  of  these  details  are  described  in  Chapter  4. 

A  significant  change  is  introduced  by  allowing  occurrence  mapp¬ 
ings  of  the  form  given  in  5.  It  necessitates  a  complicated  restating  of  the 
Hyperplane  and  Plane  Concurrency  Theorems,  as  well  as  changing  the  method 
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of  choosing  the  mapping  tt.  This  will  be  discussed  in  a  future  paper. 

Note  that  if  the  loop  (13)  satisfies  these  restrictions,  then  so 
Jc  n 

does  the  inner  DO  I  /. . ./  DO  I  loop,  for  any  k.  (The  index  variables 
IJ  for  j  <k  are  loop  constants  for  the  inner  loop.) 


II.  Meeting  the  Restrictions 


Even  if  a  given  loop  does  not  satisfy  the  above  restrictions,  it 
may  be  possible  to  rewrite  it  so  that  it  does.  We  will  give  some  useful 
techniques  for  doing  this . 

It  is  easy  to  fulfill  the  requirement  that  the  DO  statements  be 
tightly  nested.  The  method  is  illustrated  by  the  following  example.  The  loop 

DO  77  I1  =  1,  10 
16  A  (I1,  1)  =  0 

DO  77  I2  =  2,  20 

can  be  rewritten  as  the  following  tightly  nested  loop: 

DO  77  I1*  1,  10 
DO  77  I2  =  2  ,  20 

16  IF  (I2  .EQ.  2)  Ad1,  I2-l)  =  0 

« 

This  technique  is  referred  to  as  quantifying  statement  16.  It  may  be  possible 

2 

to  move  the  statement  Lock  outside  the  DO  I  loop  and  unquantify  it  after  the 
rewriting  is  performed. 

Occurrence  mappings  can  sometimes  be  rewritten  by  substituting 
for  generated  variables  so  that  condition  5  is  met.  One  trick  is  illustrated 
by  the  following  example.  Given 
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K  =  N 

DO  6  1=1,  N 

5  B(I)  =  A  (K) 

6  K  =  K  -  1 
We  can  rewrite  it  as 

DO  51  1  =  1,  N 
5  S(I)  =  A  (N  +  1  -  I) 

51  CONTINUE 
61  K  =  1 

This  use  of  auxiliary  variables  to  effect  negative  incrementing  is  fairly 
common  in  FORTRAN  programs. 

In  condition  6 ,  the  real  restriction  is  that  there  can  be  no 
possible  loops  inside  the  loop  body.  If  this  is  the  case,  then  transfer  of 
control  can  easily  be  eliminated  by  quantifying  assignment  statements  with 
logical  IFs. 
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III.  Scalar  Variables 


Even  though  the  loop  satisfies  all  the  restrictions,  it  is  clear 
that  these  methods  can  give  no  parallel  computation  if  there  are  generated 
scalar  variables.  Any  such  variable  must  be  eliminated. 

A  common  situation  is  for  the  variable  to  be  just  a  temporary 
storage  word  within  a  single  execution  of  the  loop  body.  The  variable  X  in 
the  following  loop  is  an  example 

DO  3  I  =  1 ,  10 
X  =  SQRT  <A(I)) 

B(I)  =  X 

3  C(I)  =  EXP  (X) 

In  this  loop,  each  occurrence  of  X  can  be  replaced  by  XX(I) ,  where  XX  is  a 
new  variable . 

In  general ,  we  want  to  replace  each  occurrence  of  the  scalar 
1  n 

by  VAR  (I  , . . . ,  I  ) ,  for  a  new  variable  VAR.*  A  simple  analysis  of  the  loop 
body's  flow  path  determines  if  this  is  possible. 

Another  common  situation  is  when  the  variable  X  appears  in  the 
loop  body  only  in  the  statement 

X  =  X  +  expression, 

where  the  expression  does  not  involve  X.  This  statement  just  forms  the  sum 
of  the  expression  for  all  points  in  the  index  set  \J  .  We  can  replace  it  by 

*After  the  rewriting,  to  save  space  we  can  lower  the  dimension  of  V.  by 
eliminating  any  subscript  not  containing  a  DO  FOR  ALL  index  variable. 
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the  statement 


VAR{I* , . . . , In )  =  expression , 
and  add  the  following  "statement"  after  the  loop: 

X=X+  s  VAR  (I1 , . . . , In) . 

(I1 , . . . ,  In )  c  J 

The  sum  can  be  executed  in  parallel  with  a  special  subroutine. 

The  same  approach  applies  when  the  variable  is  used  in  a 
similar  way  to  compute  the  maximum  or  minimum  value  of  an  expression 
for  all  points  in^  . 
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IV.  Conclusion 


We  have  presented  methods  for  obtaining  parallel  execution  of 
a  given  DO  loop.  Many  details  and  refinements  were  omitted  for  simplicity, 
but  all  the  basic  ideas  have  been  included.  Some  of  the  methods  are  being 
implemented  in  the  ILLIAC-IV  FORTRAN  compiler,  as  described  in  Chapter  4. 
Preliminary  study  indicates  that  they  will  yield  parallel  execution  for  a 
fairly  large  class  of  programs.  This  is  true  for  other  types  of  multiprocessor 
computers  as  well. 
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CHAPTER  FOUR 


THE  ILLIAC  IV  TRANSLATOR 


As  a  practical  example  of  the  use  of  our  methods ,  we  describe 
algorithms  for  translating  a  FORTRAN  DO  loop  into  an  ILLIAC  IV  extended 
FORTRAN  DO  /  DO  FOR  ALL  loop.  We  thus  adopt  the  syntax  of  this  extended 
FORTRAN,  as  described  in  [4],  In  particular,  note  that  the  "DO  FOR  ALL" 
statement  is  just  our  "DO  SIM"  statement. 


I.  RESTRICTIONS 

We  make  some  restrictions  on  the  given  loop  (13)  in  addition  to  re¬ 
strictions  1-7  of  Section  3-1. 

8.  No  two  variables  appearing  in  the  loop  may  be  EQUIVALENCEd 
in  any  way. 

9.  There  must  be  no  subroutine  calls  in  the  loop  body. 

10.  Functions  called  from  inside  the  loop  must  not  modify  the  value  of 
any  variable. 

11.  For  each  j  =  1 ,  . . . ,  n: 

(a)  ^  and  uJ  must  be  of  the  form 

e  +  ci  I  +  •  •  •  +  cj_i  ^  ' 

where  e  is  a  loop  constant,  and  the  c^  are  known  at  compile 
time. 

(b)  For  any  P  in  the  index  set,  4J(P)  and  u*(P)  must  be  positive. 

(c)  If  j  d^  j  >  1,  then  i *  must  be  a  loop  constant. 
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Condition  ll(b'  means  that  for  any  (P1,  ... ,  Pn)  in  the  index  set,  the 
expressions  i)  and  u^  must  be  positive  when  P1,  . . . ,  P^"1  are  substituted 
for  I1,  ... ,  Ij-1  . 
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II.  NOTATIONS 


We  now  introduce  some  notations.  Assume  that  we  are  given  the 
loop  (13).  Let  D  denote  the  element  (d^,  ...,  dn)  e  22n.  It  is  called  the 
increment  vector  of  the  loop.  For  any  point  X  =  (x* ,  . . .  ,  xn)  e  22n,  we 
define  X  *  D  e  22n  by 

X  *  D  =  (x1  d1,  ...,  xn  dn). 

As  before,  we  let  iP  denote  the  index  set  of  this  loop. 

We  define  a  way  of  representing  certain  subsets  of  22n  which  was 
partially  introduced  before.  We  let  "+"  denote  "any  positive  integer",  and 
let  denote  "any  negative  integer" .  We  then  have 

(3,  J,  6,  -)  =  {(3,  x,  6,  y)  :  x>  0,  y  <  0} 

(-2/  g/  °)  =  {(-2,  x,  y)  :  y  <  0  }  . 


Note  that  was  previously  denoted  by 


We  let  <£  denote  the  empty  set. 


Now  let  x  and  y  be  objects  of  any  kind,  and  del.  Then  we  define 

/ 

x  if  d  >  0 


$(x,  y;  d)  =  < 


y  if  d  <  0 


A(x,  y;  d)  = 


y  if  d  >  0 


x  if  d  <  0 


Thus  -HI1,  1^  +  N;  -3)  =  1^  +  N,  etc.  Note  that  A(x,y;  d)  ?  4>(y,  x;  d)  . 
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The  reason  for  this  notation  is  shown  by  the  following  trivial  result. 

Theorem:  Let  a*,  a*,  a*  e  22,  i  =  1 ,  . . . ,  n,  such  that  a*  <  ai  <  a  *.  Let 
T  be  the  linear  function  defined  by 

T  [  (I1 . In)  3  =  hx  I1  +  ...  +hn  In. 

Then 

S-Ha1,  a1;  h.)  h.  <  T  [(a1,  ...,  an)]  <E  A(a* ,  a1;  h,)  h  ; 
i  11  i  A  1 

1  n 

and  these  are  the  best  possible  bounds  cn  T  [  (a  ,  . . . ,  a  )  ]  . 

By  "best  possible  bounds "  (  we  mean  that  given  the  a1,  a*,  we  can 
choose  the  a1  so  that  T  [  (a1 ,  . . . ,  anl  ]  equals  either  of  the  bounds. 

By  the  quantification  of  an  occurrence,  we  mean  the  quantification  of 
the  statement  containing  the  occurrence,  as  described  in  Section  III— 2 .  We 
say  that  an  occurrence  is  quantified  by  1^  if  it  is  quantified  by  the  expres¬ 
sion  (Ij  .EQ.  ih  or  (Ij  .EQ.  uj). 
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III.  COMPUTING  THE  SETS  <  f,  q  > 

Let  f,  g  be  occurrences  of  a  variable,  with  occurrence  mappings 
Tf  ,  T  .  First,  we  must  generalize  our  rules  to  cover  the  case  of 
increments  different  from  1.  An  examination  of  the  reasoning  used 
in  Section  1-V  shows  that  this  is  accomplished  by  simply  generalizing 
our  definition  of  <  f,  g  >  to  the  following: 

<  f,  g  >  =  {X?2n:  Tf(P)  =  Tg(P  +  X  *  D)  for 

some  P  e  2£n  }  . 

We  can  improve  our  results  somewhat  by  considering  smaller 

i 

sets  <  f,  g  >.  Namely,  it  is  easy  to  see  that  everything  we  have  done 
remains  correct  if  we  replace  the  set  <  f,  g  >  by  any  set  containing 

{ X  e  Zn  :  Tf(P)  =  T  (p  +  X  *  D)  for  some  P,  P  +  X*DeiP 

such  that  f  is  actually  executed  for  the 
point  P  and  g  is  executed  for  the  point 
P  +  X  *  D  }  '  » 

as  a  subset. 

We  use  this  observation  to  obtain  smaller  sets  <  f,  g>  in  the 
following  two  ways: 

(i)  By  using  quantifications  of  the  occurrences  to  replace 
components  of  <  f,  g  >  by  more  restricted  components. 

(ii)  By  using  the  bounds  on  the  index  set  to  replace  <  f,  g  >  ' 
by  <p  whenever  possible. 
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! 


As  example  of  (i),  consider  the. following  statements  in  the  body 
of  loop  (13) ,  with  n  =  2  and  d*  =  d2  =  1- 

.  I  i 

IFd1  .E  Q.  A1)  A  (I2)  ='  ... 

i  i  * 


Then  we  can  take  <  al ,  a2  >  to  be  {  *  ,  -1)  instead  of  (*,  -1).  This  is 

because  al  is  only  executed  for  points  U1,  y)  e  J  ,  and  a2  is  executed 
*  , 

for  poihts  (w,  z)  with  w  >  41.  , 

.■  As  an  example  of  (ii) ,  consider  the  following  loop: 


DO  22-  1  =  1,  10 

i 

22  :  B(I)  =  B(I  +  10)  . 


We  ca,n  let  <  bl ,  bfc  >  =  <p  rather  than  (-10).,  because  P  and  P  -10  cannot 
bot(h  be  in  the  index  set  for  any  P. 


We  now  giveian  algorithm  for  computing  <  f,  g  >.  The  steps  are 
discussed  afterward. 

1.  Permute  the  subscript  positions  so  that  we  have  , 

*  t  - 

Tfd1 . ■">,=  (a1. 1  1  +  fk',  ar  A  +  f\  fr+1 . fm) 

•  »  c  c 

T  (I1 . In)  =  ^a1  Ikl  +gkl . ar  A +  gkr,  gr+1 . gm 

•  C  C 


4-6 


1 


with  k ,  < 


...  <  k  ,  and  each  a1  =  +  1.  * 

2.  If  for  some  i,  f1  -  a1  is  known  at  compile  time  to  be 

C  c 

not  equal  to  zero,  then  <  f,  g  >  =  <J>.  Otherwise, 

3.  For  j  =  1 ,  . . . ,  n: 

(a)  If  j  =  k^  for  some  i  then: 

(i)  If  fJ  -  gJ  is  known  at  compile  time , 
then  =  (f^  -  g^)  /(d^  .  aM 

i  4* 

(ii)  Otherwise,  x  -  q 

(b)  If  j  /  k^  for  all  i,  then  define  x^  by  the  following  table: 


f 

Quantified 


By: 


g  Quantified  By: 


lJ  .Eg.  tJ 

IJ  .EQ.  uJ 

Neither 

ij  .eq.  n) 

0 

+  if  uj  /  l] 

1 

1 

+  j 

+  f  5 

P  if  u;  may  equal  V 

0 

-  if  J  / 

0 

lj  .EQ.  J 

. 

0  if  it  may  equal  0 

0 

_ 

0 

+ 

+ 

Neither 

- 

0 

0 

“ft 

A  trivial  modification  to  the  algorithm  handles  the  situation 
in  which  two  of  the  k^  are  equal. 
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4.  If  for  some  j,  x3  is  a  number  which  is  not  an  integer,  then 
<  f,  g  >  =  <p‘  Otherwise, 

5.  If  for  some  j,  x^  is  an  integer  such  that  |dJ  .  x^  I  >  I  -  l)  | , 
then  <  f,  g  >  =  <f>.  Otherwise, 

6.  <  f,  g  >  =  (x1 ,  . . . ,  xn)  . 

We  now  discuss  the  individual  steps  of  the  algorithm. 

1.  This  step  is  possible  because  of  Restriction  7  of  Section  III—  1 . 

It  is  only  done  to  simplify  the  notation. 

2.  If  f*  /  g* ,  then  clearly  T,(P)  /  T  (Q)  for  all  P,  Q  e  Zn. 

c  c  t  g 

3.  (a)  (i)  If  x^  is  an  integer,  it  is  just  the  j^-  component  of 

<  f,g  >.  If  it  isn't  an  integer,  then  <  f,  g  >  -  <P. 

(il)  In  this  case,  we  know  that  either  <  f,  g  >  =  <P ,  or 
th 

its  j—  components  consists  of  an  unknown  integer. 

tL 

The  best  we  can  do  is  let  the  j—  component  of 
+ 

<  f,  g  >  be  0  . 

th  + 

(b)  In  this  case,  the  j—  component  of  <  f,  g  >  is  0  .  We  use 
the  quantification  of  f  and  g  to  refine  this  choice.  In 
Subsection  4B,  we  will  discuss  how  to  decide  whether  or 
not  uJ  may  equal  ^ . 

4.  This  step  tests  fo*  a  non-integral  xJ  produced  in  step  3(a)  (i;. 

5.  We  will  describe  later  a  method  for  finding  an  upper  bound 
for  -  &K 
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IV.  THE  COORDINATE  METHOD 


We  now  present  algorithms  for  applying  the  coordinate  method  - 
described  in  Chapter  II  -  to  our  translation  problem.  We  assume  that 
the  loop  appearing  in  the  original  FORTRaN  program  has  been  rewritten 
using  the  techniques  described  in  Chapter  3  to  make  it  satisfy  our 
restrictions. 

The  given  loop  (13)  will  be  rewritten  as 


(14) 


DO  n.  i 


1 


X 

X 


DO  a  IVk  =  XP"-k,  aVk,  d 

i 


n-k 


)  1  Jl,  )  1  )]  J  1  J  -1 

s  =  [  (r . I  >  /  [x  ,  X  +  d  . . .  u  .  CROSS 


K  j 


.CROSS.  [X'K,  >  *  +  d'K  ...  u'k]  :  condition  1  .AND 
...  .AND.  condition  m] 


3 


1 


3. 


DO  a  FOR  ALL  (I  1 . I  *)  /  S  , 

where  pj  <  . . .  <  p^_k  and  <  ...  <  (For  notational  convenience, 
we  have  replaced  the  k  of  Chapter  3  by  n-k.)  The  cor  h  -.ons  in  the  set 
expression  are  boolean  expressions.  "S"  denotes  some  unique  identifier. 

The  mapping  n  :  22n  —  22 n  k  of  Chapter  II  is  thus  defined  by 


i  Pi  d  . 

_  r  lr  1  rn.  i  _  ,,  1  T  n-k. 
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The  following  data  declarations  must  also  appear 

i  I  ji  ju  j U-  j{j 

SET  S  (  A(X  ,  u  ;  d  A),  . . . ,  A(X  K,  H  K;  d  K)  ) 

ALLOCATE  S  ((1 ,  2  ,  . . . ,  k))  . 

(Since  the  dJ  are  known  at  compile  time,  the  As  are  evaluated  by  the 
translator  and  do  not  actually  appear  in  the  translator's  output.) 


A .  Rewriting  the  DO  Statements 


,  i 


We  first  give  an  algorithm  for  finding  the  limits  X  ,  u  and  the 

ji  ji 

conditions  defining  S  in  (14).  Note  that  the  X  ,  u  must  be  integer 


constants.  We  assume  that  for  the  original  loop  (13i,  we  have 


(15)  A  =  -Q  +  4  I1+  +  rj~1 

U^  =  Uq  +  U J  I1  +  ...  +  J  , 

as  described  in  Restriction  11  of  Section  I .  The  steps  in  the  algorithm 
are  explained  afterwards. 


The  Algorithm 

1.  For  j  =  1 ,  . . . ,  n;  compute  the  following  loop  constant 
expressions: 


t 


J  =  A  + 


j-1 


li.  +  'l.  l\  <P(A  ,  u1;  d1  .  £J.) 
u  i=l  1  1 


j-1 


u  1  =  uJn  +  v  u|  A(A  ,  u  i;  d1  .  u] ) 
u  i=l  1  1 
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(u  -  l)i  =  *(1,  -1;  dj)  [uJ0  -  ZjQ 


+  e  (uj  -  a{)  Mi1,  u1;  d1  (u[  -  ;.j))] 


(u  -  e)J  =4.(1,  -1;  dj)  [  U>  -  d 


o  "o 

+  ?  (uj  -  <j)  Ad1,  u1;  dj  (uj  -  A*))] 

i  1  1  1  A 

2 .  For  j  =  1 ,  . . . ,  n; 

-  minimum  {  aS:  There  is  an  occurrence  in  the  loop, 
not  quantified  by  IJ ,  such  that 
appears  in  the  s—  subscript  position 
of  the  occurrence  mapping,  for  a 
variable  whose  dimensions  are 


(a 1 ,  . . . ,  a11)  .  } 

DJ  is  undefined  if  this  set  is  empty. 


J 


i 


3.  If  D  is  undefined  for  any  i  =  1 ,  . . . ,  k;  then  the  rewriting 
is  impossible. 

4.  For  i  =  1 ,  . . . ,  k: 


j 1  J » 

(a)  <Tv(\  \  u  ;  d  X)  =  1 


i  i  j i  / 

A{>.  \  u  l;  d  ‘)  =  < 


( 1  +  (u  -  ,  if  this  is  known  at 

compile  time 


^  D  1  otherwise 
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(b)  Throughout  the  original  loop,  and  in  (15),  replace 

ji  ji  ji  ji  ji  ji  ji  ji  ji 
I  1  by  r+i-r,  £  q  by  Ji  Q  -  1 1  +  X  1  and  u  Q  by 

ji  ji  ji  i  i 

Uq“A  +  ^  •  I.e.  /  recompute  the  £q  and  u^  ,  but  not 

the  quantities  computed  in  step  1. 


(c)  If  i  1  is  not  known  at  compile  time,  add  the  condition 


h 


'i  h  Ji 

(£  1  <f>(  .LE.  ,  .GE.  ;  d  1  I1)  . 


J, 


*  1 

(d)  If  u  is  not  known  at  compile  time,  add  the  condition 

ji  ji  ji  ji  ji 

(u  $(  .GE,  ,  .LE.  ;  d  )  I  ).  Otherwise,  set  u  =  u 


j 


i 


(Note:  This  is  the  new  value  of  u  computed  in  (b) .  ) 
5.  For  i  =  1 ,  . . . ,  n  -  k: 

pi  pi 

(a)  Set  X  equal  to  £  ,  with 

L  Jr  Jr  Pi 
*(X  r,  ur;  dr  .  £  l) 

Jr 

Jr 

substituted  for  each  I  ,  r  =  1 ,  ...,k. 


pi  pi 

(b)  Set  u  equal  to  u  ,  with 


jr  Jr  Jr  Pi 
A(Xr,  ur,  dr  .  £  l) 

r  j 

substituted  for  each  I  r. 

(c)  For  r  =  1 ,  . . . ,  k : 


(i)  If  £j 1  /  0 ,  jr  <  p.;  then  add  the  condition 


p. 


Pi\  Pi 


(  £  *  *(.LE.  ,  .GE.  ;dir). 
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pi 

(ii)  If  u  .  /  0 ,  j  <  p.  ;  then  add  the  condition 

Jr  1 

pi  Pj 

(u  4>(.GE.,  .LE.';  d  )  I  . 

6.  Let  i  be  the  smallest  integer  such  that  the  conditions  defining 

Pi  p.+1  p  , 

S  do  not  depend  upon  the  values  of  I  1  ,  I  1  1 ,  . . .  f  i  n  . 

If  i  exists,  then  place  the  assignment  statement  for  S  just 


before  the  DO  a  I  statement.  Otherwise,  place  it  after  the 


Pn— k 

DO  a  I  ,  as  it  appears  in  (14). 


Explanation  of  the  Algorithm 

1.  If  dJ  >  0,  then  by  the  theorem  of  Section  II,  j)  <  I^(P)  <  u-*  for 
all  P  e  (j?  .  If  d^  <  0,  then  the  inequalities  are  reversed. 

Similarly,  (u  -£)J  and  (u  -  are  chosen  so  that 
(u  ~  4)J  <  |  uj(P)  -  Cj(P)  |  <  (u  -  C)j  for  all  P  e  d) . 
Although  the  (u  -  l)^  are  not  used  in  this  algorithm,  this 
is  an  appropriate  place  to  calculate  them.  They  will  be 
used  later. 

2.  The  number  D]  provides  an  upper  bound  for 

1  +  |  uj(P)  -  4j(P)  |  ;  P  e  (D  .  To  see  this,  suppose  the 
2  3 

occurrence  A(I  ,  I  +  K)  appears  in  the  loop,  and  A  is 
dimensioned  by 

DIMENSION  A(10,  33). 
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If  references  occur  to  both  A(i,  j)  and  A(i' ,  j') ,  then 
|  i  -  i'  |  <  10  -  1  =  9,  |  j  -  j'  !  <  32.  Hence, 

|  u3(P)  +  K  -  Ce3{P)  +  K)  1=1  u3(P)  -  43(P)  I  <33-1. 

3.  D  is  undefined  if  I  does  not  appear  in  any  unquantified 
variable  occurrences.  If  this  is  the  case,  then  we  have  made 
a  bad  choice  of  DO  FOR  ALL  variables  (i.e. ,  of  the  mapping 

tt  ). 

ji  ji  ji 

4.  (a)  Suppose  d  >0.  Then  u  -  X  is  an  upper  bound  for 


j,  ji  n  ji 

u  (P)  -  l  (P) ,  P  e  tV  .  If  d  <  0  ,  rev  srse  the  signs 
in  the  preceding  statement.  Note  that  X  and  u  are 
actual  numbers,  known  at  compile  time. 

j 

(b)  We  rewrite  the  loop  so  that  the  values  assumed  by  I 

are  as  small  as  we  can  make  them  and  still  be  sure  that 
they  are  positive. 


i 


Note  that  if  l  was  originally  a  loop  constant, 

U 

then  after  the  substitution  it  equals  X  ,  a  known  number. 
This  is  necessary  if  |  d  1  |  >  1 ,  otherwise  our  set  con¬ 
struction  would  not  work  right.  I.e, ,  the  set 
)«  L  jt  h 

[\,X  +d  ...  ]  would  not  necessarily  contain 


) 


i 


integers  in  the  correct  congruence  class  mod  d  .  This 
is  the  reason  for  restriction  11  (c)  of  Section  I. 

h  h  h  ji 

(c) ,  (d)  If  X  or  u  is  not  the  actual  value  of  i  °r  u 

respectively,  then  the  appropriate  tests  must  be  inserted. 
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i  r 

5.  If  the  limits  on  I  depend  upon  one  or  more  I  ,  then  the 

3r  3  m  3_ 

bounds  X  ,  u  on  I  must  be  used  to  determine  the  nev  limits 
pi 

on  I  ,  and  the  original  limit  test  must  appear  in  the  definition 

P:  P.- 

of  S.  Note  that  the  expressions  l  ,  u  1  are  the  ones 
recomputed  in  Step  4(b). 

6.  This  just  removes  loop-invariant  code  from  the  innermost  DO 
loops . 

B.  The  Bounds  on  1  u3  -  l)  J 

The  algorithm  of  Subsection  A  finds  the  bounds  on  \  u3  -  C3  | 
needed  for  the  algorithm  of  Section  3  which  constructs  the  sets  <  f,  g  >. 

For  Step  3  of  the  algorithm  of  Section  III,  we  can  say  that 
u3  /  £3  if  and  only  if  (u  -  l)3  is  known  at  compile  time,  and  is  not 
equal  to  zero. 

For  Step  5  of  that  algorithm,  we  may  replace  |  u3  -  £3  |  by 
(u  -  £)3  if  it  is  known  at  compile  time,  or  by  D3  if  it  is  defined.  If 
neither  replacement  is  possible,  then  the  test  may  not  be  applied  for 
that  value  of  j . 

Note  that  the  values  of  (u  -  l)3  and  (u  -  are  not  changed  by  the 

U 

substitution  for  the  I  performed  in  Step  4(b)  of  the  algorithm  of  Sub- 

» 

section  A. 
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C.  The  Coordinate  Algorithm 


We  now  describe  a  complete  algorithm  for  rewriting  the  given  loop 
(13)  as  a  DO  /  DO  FOR  ALL  loop  of  the  form  (14).  Actually,  we  will 
"un-tightly  nest"  the  loop  if  possible  by  moving  quantified  statements 
outside  of  inner  loops.  (See  Section  3— II.)  A  discussion  of  the  individual 
steps  follows  the  description  of  the  algorithm. 

The  Algorithm 

1.  Find  the  sets  <  f,  g  >  required  by  rules  SI  and  S2,  using 
the  algorithm  of  Section  3. 

j  1  iw 

2.  Choose  the  DO  FOR  ALL  variables  I  ,  . . . ,  I  . 

3.  Apply  the  algorithm  of  Subsection  A  to  rewrite  the  DO  state¬ 
ments.  If  the  rewriting  is  found  to  be  impossible,  stop. 

4.  Apply  rules  S1-S3  of  Chapter  2  to  obtain  the  ordering 
relations  «,  If  the  rewriting  is  found  to  be  impossible,  stop. 

5.  Replace  the  ordering  «  by  its  transitive  closure.  I.e. ,  add 
the  relations  f  «  g  which  are  implied  by  transitivity  from  the 
original  ordering.  Thus,  if  f  «  h  «  g,  then  add  the  relation 
f  «  g. 

6.  If  for  some  generation  g,  the  relation  g  «  g  holds,  then  go  to 
the  algorithm  described  in  Subsection  E. 

7.  Complete  the  ordering  of  the  generations  to  a  total  ordering  by 
the  following  procedures.  Let  f  and  g  be  generations  which  are 
unordered  with  respect  to  one  another  (neither  f  «  g  nor  g  «  f 
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holds).  Then: 

■  (a)  With  the  notation  of  (14):  if,  for  some  i,  f  is 
pi  pi 

quantified  by  (I  ,EQ.  I  )  and  g  is  not,  then  f  «  g. 

pi  pi 

(b)  If,  for  some  i,  f  is  quantified  by  (I  . EQ.  u  )  and 
g  is  not,  then  g  «  f. 

(c)  If  the  statements  containing  f  and  g  are  both  of  the 
form  IF  (expression),  with  the  same  "expression",  then 
make  f  and  g  adjacent  in  the  complete  ordering.  (I.e. ,  if 
we  make  f  «  g ,  then  f  «  h  «  g  must  imply  h  =  f  or 

h  =  g.) 

After  applying  (a),  (bl,  and  (c^  for  all  pairs  of  unordered 
generations ,  complete  the  ordering  by  using  the  order  in  which 
the  generations  appear  in  the  original  loop. 

Note:  No  new  ordering  relations  involving  uses  are  added. 
8 .  Denote  the  generations  by  g  1 ,  . . . ,  gm ,  with  g j  «  . . ,  «  g  . 

4.  L. 

Reorder  the  statements  of  the  loop  body  so  that  the  i—  state¬ 
ment  is  the  one  in  which  g^  appears,  for  i  =  1 ,  . . . ,  m  -  1. 
(Note  that  by  our  restrictions ,  every  loop  body  statement  con¬ 
tains  a  generation.) 

For  notational  convenience,  introduce  an  imaginary 
generation  g^  with  gg  «  f  for  every  occurrence  f  in  the  loop, 
and  a  corresponding  Statement  0. 
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9.  For  each  variable  use  f,  let 

first  (f)  =  maximum  {  i  :  g.  «  f  } 
last  (f)  =  minimum  {  i  :  f  «  } 

Note  that  if  f  appears  in  Statement  j ,  then 
0  <  first  (f)  <  last  (f)  <  j. 

10.  For  j  =  1,  . . . ,  m: 

(a)  For  each  use  f  appearing  in  Statement  j: 

(i)  If  last  (f)  =  j,  leave  f  unchanged. 

(ii)  If  last  (f)  <  j,  then  let  f  be  a  variable 

occurrence  with  the  same  occurrence  mapping 

as  f,  but  with  a  different,  previously  unused 

variable  name.  Replace  f  by  f  in  statement  j. 

Form  an  assignment  statement  f  =  f,  with  the 

pi  pi  pi 

same  quantifications  (I  .EQ.  Z  or  u  )  as 

statement  j.  For  example,  if  statement  j  is 

P2  P2  ^1  ^  1 
IF  ((I  4  .EQ.  I  n  .AND.  (I  1  .EQ.  I  ) )  ..., 

form  the  statement 

P2  P2  - 
IF  (I  .EQ.  £  )  f  =  f. 

Insert  this  new  statement  anywhere  between 

statements  first  (f)  and  last  (f). 

(b)  If  Step  (a;  (ii)  was  applied  to  two  distinct  but 
identical  uses  f ^ (i.e. ,  uses  of  the  same  variable 
having  the  same  occurrence  mappings),  then  if  possible, 
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replace  and'  f ^  by  a  single  occurrence  f,  with  a  single 
generation. 

11.  (a)  If  the  first  q  statements  of  the  loop  body  (including 

those  added  in  step  (10))  are  all  quantified  by 

p.  p. 

(I  i  .EQ.  i  J)  for  j  =  i,  i  +  1,  . . . ,  n  -  k ,  but  the  ; 
s  t 

(q  +  1) —  statement  is  not  then:  s 

(i)  Move  these  q  statements  in  front  of  the 
pi 

DO  I  statement.  i 

(ii)  For  j  =  1 ,  . . . ,  n  -  k:  Delete  the( 

p.  p. 

(I  J  .  EQ.  I  J)  quantifiers,  and  replace  all 
pi  pi 

instances  of  I  1  by  i  1  in  these  statements. 

(iii)  Insert  a 

j  i  iv 

DO  0  FOR  ALL  (I  \  ...,  I  K)  /  S- 

statement  in  front  of  these  q  statements  and  a 

} 

P  CONTINUE 
statement  after  them. 

(iv)  If  the  assignment  statement  for  S  lies 

pi 

inside  the  DO  I  loop:  then  copy  the  assign- 

P1 

ment  statement  for  S,  except  with  I  J 
pi 

replaced  by  l  1  for  j  =  i ,  i  +  1 ,  . . . , 

(v)  Repeat  (a). 


n  -  k 


I 

(b)  If  the  last  q  statements  of  the  loop  body  are  all 

p.  p. 

quantified  :by  (I  .EQ.  -2  J)  for  j  =  i/  . . . ,  n  -  k,  but 
the  (q  +  1) — 'Statement  is  not,  then  perform  the  analogue 
of  steps  (a)  (i)  -  (v)  to  place  these  statements  after  the 
Pi  r 

DO  I  loop. 

I 

12.  For  each  variable  introduced  in  Step  10,  and  each  variable 
introduced  by  the  procedures  of  Section  3— III: 

(a)  For  each  occurrence  of  the  variable.:  Delete  all  sub- 

!  '  h 

script  positions  not  referencing  any  I  .  E.g. ,  replace 

if  Po  3  a  3o 

VAR  (I  +  2 ,  I  L,  I  ,  4)  by  VAR  (I  1  +  2 ,  I  6) . 

(b)  Create  the  appropriate  data1  type  and. dimension  state- 

i 

,  ments.  The  actual  dimensions  are  ootained  by: 

(i)  For  the  set  S  introduced  in  Step  3,  and  the 
variables  introduced  as  described  in  Section 
3— III ,  the  i—  dimension  is  A(X  u  S  d  *) , 

1 

V  3.  . 

where  X  ,  u  were  found  in  Step  3. 

,(ii)  For  the  variables  introduced  in  Step  10, 
the  dimensions  are  obtained  in  the  obvious  way 

S 

from  the  dimensions  of  the  variables  which 

•  i 

they  replace. 

(c)  Create1  the  appropriate  ALLOCATE  command  needed  to 
make  tthe  DO  FOR  ALL  valid. 

i 

l  ‘  f  ! 

i 

1 
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(d)  Place  the  variable  in  an  OVERLAP  statement  so  as  to 
allow  it  to  occupy  the  same  storage  area  as  other  variables 
similarly  introduced  in  other  loops. 

13.  For  each  original  loop  variable  VAR  which  does  not  have  the 
proper  storage  allocation  for  the  DO  FOR  ALL,  introduce  a  new 
variable  VAR  of  the  same  dimensions,  but  appropriate  allocation. 
Add  the  statement 

VAR  =  VAR 

in  the  front  of  the  DO  FOR  ALL  loop.  If  VAR  is  a  generated 
variable,  add 

VAR  =  VAR 

after  the  DO  FOR  ALL  loop.  Replace  all  instances  of  VAR  in 
the  loop  by  VAR. 

Discussion  of  the  Algorithm 

2.  Some  heuristic  will  be  needed  for  choosing  the  DO  FOR  ALL 
variables,  probably  involving  the  sets  <  f,  g  >  computed  in 
Step  1. 

5.  See  [5]  for  an  algorithm  to  do  this. 

6.  If  g  «  g  holds  for  some  g ,  then  the  generation  g  cannot 
be  executed  inside  the  DO  FOR  ALL  loop.  However,  as  we 
will  see,  the  DO  FOR  ALL  may  be  applied  to  any  other 
generation  h  for  which  the  relation  h  «  h  does  not  appear. 

7.  We  need  a  total  ordering  of  the  statements  so  that  we  can 
rewrite  the  loop  body.  I.e. ,  we  need  to  know  the  order  in 
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which  to  write  down  the  statements.  This  is  done  by  totally 
ordering  the  generations.  (Since  there  are  no  control  state¬ 
ments,  there  is  exactly  one  generation  per  statement.)  If 
Step  5  did  not  provide  a  total  ordering ,  we  must  add  ordering 
relations  until  we  get  one. 

Steps  (a)  and  (b)  try  to  maximize  the  chances  of  being 
able  to  use  Step  11.  Step  (c)  tries  to  make  the  handling  of 
mode  sets  as  easy  as  possible  for  the  compiler. 

If  these  steps  do  not  determine  a  total  ordering,  using  the 
order  of  appearance  in  the  loop  seems  as  good  a  method  as 
any  to  complete  the  ordering. 

This  entire  step  is  not  very  precisely  formulated,  but 
should  indicate  how  an  algorithm  for  executing  it  can  be 
obtained . 

9.  The  use  f  must  be  executed  between  generation  numbers 
first  (f)  and  last  (f).  If  first  (f)  >  last  (f) ,  then  there  would 
be  a  generation  g  with  g  «  g. 

10,  (a)  (ii)  In  this  case  the  use  f  must  be  executed  before  the 

value  it  produces  is  needed.  I.e. ,  the  load  must  be 
executed  before  statement  j  is  executed.  Hence,  a 
temporary  storage  location  is  needed. 

(b)  The  conditions  for  combining  two  temperary  storage 
locations  are  clear,  although  when  to  combine  IF  clauses, 
etc. ,  requires  a  detailed  algorithm.  Note  that  a  necessary 
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condition  for  being  able  to  combine  the  assignment  state¬ 
ments  is  that  there  exists  an  r  such  that 
first  (fj)  <  r  <  last  (f^) 
for  i  =  1 ,  2 . 

Even  if  this  is  not  possible,  the  same  variable  can 
sometimes  be  used,  with  two  generations,  thus  saving 
storage  space. 

A  method  of  minimizing  storage  space  which  we  have 
not  considered  is  illustrated  by  the  following  example. 
Suppose  we  have 

A  (l1)  =  Bd1)  +  Ctt1} 

and  (a)  (ii)  indicates  that  we  need  temporary  variables  for 
these  uses  of  B  and  C.  It  may  be  possible  to  rewrite  this- 
as 

BC  (I1)  =  B(I])  +  cd1) 

• 

Ad1)  *  BC  d1)  . 

Developing  an  algorithm  to  do  this  is  rather  involved,  and 
has  not  been  done. 

11.  This  is  a  fairly  obvious  procedure  for  moving  statements  out¬ 
side  of  inner  loops.  It  actually  reverses  the  procedure  for 
moving  them  into  inner  loops  described  in  Section  3— II.  As 
the  example  given  in  Subsection  D  shows ,  what  gets  moved 
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inside  a  loop  cannot  always  be  moved  back  outside 
again. 

If  (a)  and  (b)  both  move  statements  outside  the  same 
pi 

DO  I  loop ,  then  there  is  no  need  to  compute  the  same  set 
S  twice.  However,  a  new  set  variable  must  be  used  in  place 
of  S  in  steps  (iii)  and  (iv) . 

12.  (a)  This  is  an  obvious  space  saving  technique. 

(b)  Our  algorithm  has  neglected  the  possibility  of  over¬ 
lapping  the  variables  incroduced  in  Steps  10  and  13  with 
each  other.  A  trivial  analysis  shows  when  two  such 
variables  may  be  overlapped.  Finding  the  optimal  over¬ 
lapping  among  a  collection  of  such  variables  requires  a 
clever  algorithm,  which  has  not  been  developed. 

13.  This  is  an  obvious  solution  to  the  reallocation  problem.  In 

some  cases  it  would  be  better  to  put  the  reallocation  inside  one 

Pi  _ 

or  more  of  the  outer  DO  I  loops ,  and  set  VAR  equal  to  the 
appropriate  "coordinate  slice"  of  VAR.  However,  we  have  not 
devised  an  algorithm  to  do  this. 

D.  An  Example 

We  illustrate  the  preceding  algorithm  by  applying  it  to  the  following 
unlikely  loop: 
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DO  17  I1  =  2,  N 
DO  17  I2  =  l1,  1,  -1 

13  IF  (I2  .EQ.  I1)  Bd1)  =  A(I2 ,  I1  -  1)  +  SIN  (3.14  *  I1  /  180) 


14  A(I2 ,  I1)  =  Cd1,  I2  +  K)  +  .5 

©  © 

15  Cd1,  I2)  =  A(I2 ,  I1  +  1)  *  Bd1) 

(0  @  (b£) 

16  IF  (I2  .EQ.  1)  E(IJ)  =  Cd1,  I2) 


17  CONTINUE. 

with  the  following  dimension  information: 

DIMENSION  A(4,  95),  B(97),  C(99,  8),  £(101) 

Note  that  statements  13  and  16  were  originally  outside  the  inner 
2 

DO  I  loop,  but  were  moved  inside  it  by  the  method  of  Section  3 -II. 
We  now  apply  the  algorithm  as  follows  : 

1.  The  algorithm  of  Section  III  yields  the  following: 
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<  Al,  A2  > 

rr 

(-1,  0) 

<  A2 ,  A2  > 

= 

(0,  0) 

<  A2 ,  A3  > 

= 

(-1,  0) 

<  Bl,  B1  > 

= 

(0,  0) 

<  Bl,  B2  > 

= 

(0,  q)  [using  si -ip  3(b)] 

<  Cl,  C2  > 

= 

(0,  q)  [using  s<.ap  3(a)  (ii)  ] 

<  C2,  C3  > 

= 

(0,  0) 

<  El,  El  > 

= 

(0,0). 

2.  We  choose 

to  rewrite  the  loop  with  a  DO  FOR  ALL  I* ,  so  we 

let  J1  = 

1  and  p^  =  2.  Then 

"[(I1,  I2)]  =  J2  . 

3.  We  apply  the  algorithm  of  Subsection  A  as  follows: 

1.  A1  =  2,  u  1  =  N,  (u  -  1) 1  =  N  -  2,  (u  -  4)1  =  N  -  2 
A2  =  N,  u2  =  1,  (u_rJ)2  =  1  /  (U^l)2  =  N  -  1 

2.  D1  =  minimum  {  95,  97,  99,  101  }  =  95 

o 

D  =  minimum  {4,  8  }  =  4 

3.  D*  is  defined,  so  no  problem. 

4.  (a)  X1  =  1 

U1  =  95 

(b)  Substitute  I*  +  1  for  I"  throughout  the  loop,  and 
change  4* ,  u1  as  follows: 
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(c)  e*  known  at  compile  time,  no  condition 

(d)  Add  the  condition  ((N  -  1)  .GE.  1^)  . 


5.  (a)  X2  =  96  (=1  +uX) 

(b)  a2  =  1 

(c)  (i)  Add  the  condition  ((1  +  I1)  .GE.  I2) 

(ii)  No  condition 

2 

6.  The  conditions  defining  S  involve  I  ,  so  the  assignment 

2 

statement  for  S  follows  the  DO  17  1  statement. 

Putting  this  all  together,  we  get  the  loop  control  statement 

DO  17  I2  =  96,  1,  -1 

S  =  [  I1  /  [1,  2  ...  95]  :  ((N  -  1)  .GE.  I1)  .AND. 

((I1  +  1)  .GE.  I2)  ] 

DO  17  FOR  ALL  I1  /  S 
and  the  new  loop  body: 

13  IF  (I2  .EQ.  I1  +  1)  B(IX  +  1)  =  A(I2,  I1'  +  SIN  (3.14  *  I1  /  180) 

©  ® 

14  A(I2,  I1  +  1)  =  Cd1  +  1,  I2  +  K)  +  .5 

©  © 

15  Cd1  +  1,  I2)  =  A(I2 ,  I1  +  2)  *  Bd1  +  1) 

©  ©  © 


16  IF  (I2  .EQ.  1)  Ed1  +  1)  =  Cd1  +  1,  I2) 


Note  that  none  of  the  sets  <  f,  g  >  computed  in  Step  1  are 
changed  by  the  substitution  for  I*. 

We  must  also  include  the  declarations 
SET  S  (95) 

DIMENSION  A(95)  , 

t;* 

and  add  S,  A  to  an  OVERLAP  statement. 

4.  The  rules  give  the  following  relations: 

SI:  A2  «  A1 
A3  «  A2 

(and  the  rewriting  is  not  impossible) 

S2:  Bl  «  B2 
Cl  «  C2 
C2  «  C3 
S3:  A1  «  Bl 
Cl  «  A2 
A3  «  C2 
B2  «  C2 
C3  «  El 

5.  This  gives  the  following  additional  ordering  relations: 


A2 

« 

Bl 

[from  A2  «  Al,  A1  «  Bl] 

C2 

« 

El 

[C2  «  C3  «  El] 

Bl 

« 

El 

[Bl  «  B2  «  C3  «  El] 

A2 

« 

El 

[A2  «  Bl  «  El] 

Bl 

« 

C2 

[Bl  «  B2  «  C2] 
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6.  g  «  g  does  not  hold  for  any  generation  g. 

7.  The  ordering  of  the  generations  obtained  by  4  and  5  is  already 
totally  ordered.  It  is:  A2  «  B1  «  C2  «  El. 

8.  g1  =  A2 ,  g2  =  Bl,  g3  =  C2 ,  g 4  =  El 

This  assigns  the  numbers  1-4  to  the  FORTRAN  statements 


as  follows: 

14  — 

1, 

13 

— 

2, 

first  (Al)  = 

1, 

last 

(Al) 

=  2 

first  (Cl)  = 

0, 

last 

(Cl) 

=  1 

first  (A3)  = 

0, 

last 

(A3) 

=  1 

first  (B2)  - 

2, 

last 

(B2) 

=  3 

first  (C3)  = 

3, 

last 

(C3) 

=  4 

10.  (a)  The  only  use  for  which  (ii)  holds  is  A3.  We  must 
change  statement  15  which  contains  A3  to  (recall  that  we 
are  using  the  loop  as  rewritten  in  Step  3): 

15  Ctt1  +  1,  I2)  =  A  (I2,  I1  +  2)  *  Bd1  +  1) 
and  add  the  following  statement  before  the  first  state¬ 
ment  of  the  loop  body 

A(I2,  I1  +  2)  =  A(I2,  I1  +  2) 

(b)  Does  not  apply. 

2 

11.  (a)  We  may  remove  statement  16  from  the  outer  DO  I 
loop,  with  its  own  DO  FOR  ALL,  as  follows: 
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S  =  (I1  /  [1,  2  ...  95]  :  ((N  -  1)  .GE.  I1)  .AND. 


((I1  +  1)  .GE.  1)) 

DO  116  FOR  ALL  I1  /  S 
16  Ed1  +  1)  =  Cd1  +  1,  1) 

116  CONTINUE 

12.  (a)  We  replace  the  two  occurrences  A(I  ,  I  +2)  intro¬ 
duced  in  Step  10  by  A(I^  +  2). 

(b)  We  need  the  following  statements 

SET  S(95^ 

DIMENSION  A (9 5) 

(c)  Since  we  have  only  added  i~dimensional  variables, 
no  ALLOCATE  statements  need  be  added. 

(d)  We  will  have  an  OVERLAP  statement  of  the  form 

* 

OVERLAP  ...  (S,  A)  ... 

13.  Assuming  default  allocations,  this  step  is  vacuous. 

Putting  this  altogether  now,  we  finally  get  the  following  loop: 

DO  17  I2  =  96,  1,  -  1 

S  =  [I1  / ]  1 ,  2  ...  95]  :  ((N  -  1)  .GE.  I1)  .AND. 

(d1  +  1)  .GE.  I2)  ] 

DO  17  FOR  ALL  I1  /  S 
A  (I1  +2)  =  Ad2,  I1  +  2) 

14  Ad2,  I1  +  1)  =  Cd1  +  1,  I2  +  K)  +  .5 
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T 


TTT.  *  l*<css 


13  IF(I2  .EQ.  I1  +  1)  Bd1  +  1)  +  A(I2,  I1)  +  SIN  (3.14  *  I1  /  180) 

l 

15  Cd1  +  1,  I2)  =  Ad1  +  2^  *  BCI1  +  1) 

17  CONTINUE  :  :  ' 

I 

S  =  [I1  /  [1,  2  ...  95]  :  dN  -  1)  .GE.  I1)  .AND. 

j 

(d1  +  1)  .GE.  1)] 

DO  116  FOP.  ALL  I1  /  S  •  , 

16  Ed1  +  1)  =  cd1  +D 

116  CONTINUE  ■  : 

E.  The  Algorithm  for  Inconsistent  Orderings  : 

We  now  give  an  algorithm  to  handle  the  situation  in  which  Step  1 

*  »  « 

4  or  5  of  the  coordinate  algorithm  of  Subsection  C  produces  an  incon- 

,  i 

*  • 

sistent  ordering  -  i.e. ,  one  containing  a  relation,  of  the  form  g  «  g. 

The  algorithm  is  discussed  afterwards. 

The  Algorithm  :  1 

t 

1.  Define  an  equivalence  relation  on  the  set  of  generations  by 
f  g  if  and  only  if  (i)  f  =  g ,  or  (ii)  f  «  g  «  f . 

Let  [f]  denote  the  equivalence  class  o'f  f.  ,I.e., 

[f]  =  {  g  :  g  vo  f}  . 

Let  $  denote  the  set  of  equivalence  classes.  _  '■ 

Define  an  order  relation  on  ^by  [f]  «  [g]  if  and  only  if 
[f]  /  [g]  and  f  «  g.  (Note  that  this  is  independent  of  the 
choice  of  f  e  [f] ,  g  e  [g].)  This  ordering  is  transitively  , 

!  i 

r 


t 
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■  closed  (because  thei  ordering  of-  generations  is)  and  is  con^ 
sistent.  I.e.,  [gi]  ^  (g3  f'v  all  generations  g. 

i 

Call  the  class  If]'  good  ilf  f  ^ f-,  and  bad  if  f  <  f. 

(Again,  this  is  independent  of  the  choide  of  f  e  [f]  .) 

2.  Write  as  the  disjoint,  union  of  subsets  ,  . . . ,  Jty ; 

n  i  m 

where  m  is  as  small  asipossible,  and  the  satisfy: 

I 

(i)  Theielements  of  ^  are  either  all  good  or  all  bad. 

(ii)  If  [f]  e  ,  [g]  e  and  i  <  ;j ,  then  [f]  [g]‘. 

1  !  J 

Let'G.  =i  {  g  :  [g].  e  }.  Call  Gi  good  if  for  each 
■  g  e  G.  [g]  is  good,  and  cal‘1  it  bad  if  [g]  is  bad  for  each 

1  '  .  i' 

[g]  t  G.  Then  (i)  states  that  G.  is  either  good  or  bad.  The 
i  minimality  of  m  implies  that  if  G.  is  good,  then  G.  _  j  and 
Gj  +  j  are  bad. 

I  '  t 

3.  Totally  order  the  generations  as  follows:  ' 

i  i 

1  (a)  If  G;  is  good ,  use  Step  7  of  the  Coordinate  algorithm 

i  -  1  1 

'  of  Subsection  C,  minus  parts  (a)  and  (b),  to  order  the 
! 

j  elements  of'Gj,  =  '  > 

>  (b)  If  G.  is  bad,  order  the  elements  of  G,  by  the  order  in 

1  i  '  1 

1  which  they  appear  in  the  original  loop. 

! 

(<?)  If, f  e  G^,  g  e  G.  and  i  <  j,  then  f  «  g..  (Note  that 
1  property  (ii)  of  the  in  Step  2  assures  that  this  gives 

i 1 

-  ■  a  consistent  ordering.) 

4v  For  each  G. ,  'le’t  G,  be  the  set  of  all  occurrences  f  such  that 

,  i  1 

,  f  appears  in  the  same1  statement  as  g  for  some  g  e  G, .  •  If 

J  * 
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! 
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is  a  bad  set  and  g  e  let  G^  =  { f  :  f  a  use  and  f  e  Gj 
or  f  <  g  <  f  } 

5.  For  each  bad  set  G.:  Delete  all  relations  of  the  form  f  «  g, 
f,  geG.,  which  were  found  by  rule  SI  but  not  by  rules  S2  or 
S3  (applied  in  Step  6  of  the  Coordinate  Algorithm) . 

All  the  remaining  ordering  relations  f  «  g,  including 
those  defined  in  Step  3,  form  a  consistent  ordering  of  the 
occurrences . 

6.  For  each  bad  set  G^  and  each  use  f  e  Gf  ™  G.,  introduce  a 
new  k-dimensional  variable  A  and  add  the  statement 

-  ji  jk 

Ad1 . IK)  =  f 

Let  f  denote  this  generation  of  A,  Add  f  to  G^  add  the 
ordering  f  «  f  ,  and  totally  order  Gi  so  that  the  new  ordering 
(of  all  occurrences)  is  consistent.  Finally,  replace  the 
original  use  f  by  A  (I  ,  . , . ,  I  ) . 

7.  Perform  Steps  8,  9,  10,  and  12  of  the  Coordinate  Algorithm. 
Add  the  generations  inserted  by  Step  10  to  the  appropriate 
sets  Gj  (maintaining  3(c)).  Whenever  possible,  these  new 
statements  should  be  inserted  so  that  their  generations  are 
included  in  good  sets. 

8.  Let  ^  be  the  statements  containing  the  generations  in  G^, 
ordered  by  the  ordering  between  the  generations. 
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9*  Rewrite  the  loop  as  an  outer  DO  I  ,  .. . ,  DO  I  loop  con- 
taining  the  assignment  statement  for  S  (as  constructed  in  Step  3 
of  the  Coordinate  Algorithm  of  Subsection  C) ,  and  a  sequence 
of  m  inner  loops  as  follows. 

For i =  1 ,  . . . ,  m: 

(a)  If  Gj  is  a  good  set,  then  insert  a 

ji  jk  , 

DO  FOR  ALL  (I  " ,  . . . ,  I  K)  /  S 
loop  whose  body  consists  of  the  statements 

(b)  If  Gj  is  a  bad  set,  then  insert  the  loop 

jl  jl  jl  jl 
DO  .8  I  L  =  i  ,  u  1 ,  d  L 


DO  M  =«  ,uK,dK 

IF  (.NOT,  (condition  1  .AND . AND.  condition  p))  GO  TO  3 

4 

3  CONTINUE, 

where  the  conditions  are  those  chosen  in  Step  5(c)  of  the 
algorithm  of  Subsection  A  (which  was  executed  in  Step  3  of  the 
Coordinate  Algorithm) . 

10,  Perform  Step  13  of  the  Coordinate  Algorithm. 


Remarks  on  the  Algorithm 

The  fact  that  the  algorithm  is  correct  follows  from  the  following 
observation: 
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The  rewritten  loop  vill  be  computationally  equivalent  to  the 
original  one  if  for  every  relation  f  «  g  produced  by  the  rules  S1-S3, 


g  appears  in  Jiy  with  i  <  j. 


one  of  the  following  holds: 

(i)  f  appears  in  J^, 

(ii)  f  and  g  appear  in  the  same  DO  FOR  ALL  loop,  and  pre¬ 
cedes  y. 

(iii)  f  and  g  appear  In  the  same  DO  loop,  in  the  same  order 
in  which  they  appeared  in  the  original  loop. 


(iv)  f  and  g  appear  in  the  same  DO  loop,  and  f  precedes  g. 

The  only  ones  of  the  above  conditions  which  are  not  obviously 
sufficient  are  (iii)  and  (iv) .  For  (iv) ,  observe  that  the  order  of  execution 
of  f  and  g  only  matters  If  0  e  <  f ,  g  >  and  originally  f  «  g  or  g  «  f. 
However,  if  0  e  <  f,  g  >  and  originally  g  «f,  then  rule  S2  produces 
g  «  f,  so  the  above  conditions  cannot  hold  for  all  f  and  g.  Hence,  if 
0  e  <  f ,  g  >,  then  f  «  g  and  (iv)  implies  (iii) . 

For  (iii) ,  let  J  be  the  mapping  we  defined  in  Section  1-3 ,  which 
for  the  loop  (14)  is  given  by 


J  [  (IJ 


In)]  =  (I  1 . In“k,  I 


IK) 


Then  the  order  of  execution  of  references  by  f  and  g  to  the  same  variable 

Ji 

element  are  reversed  only  if  for  some  A  e  <  f,  g  >,  and  A  >  0  and 
J(A)  <  () ,  or  h>0  and  J(A)  >  ().  But  this  is  precisely  what  rule  SI 
forbids . 
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For  Step  6,  note  that  f  e  G^  and  f  e  for  j  / J  implies  i  <  j. 

To  see  this,  let  e  G^,  so  f  «  g^  «  f,  and  let  g^  e  Gj  be  the  generation 
for  the  statement  containing  f,  so  f  «  g^.  Then  g^  «  f  «  g^  implies  that 
i  <  j.  This  means  that  the  use  of  A  which  replaced  the  original  use  f 
must  follow  the  generation  of  A  introduced  in  Step  6. 
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V.  THE  PLANE  METHOD 


A.  Introduction 

We  now  consider  the  application  of  the  methods  developed  in 
Chapter  1  to  our  translation  problem.  Since  the  ILLIAC  extended 
FORTRAN'S  DO  FOR  ALL  is  a  DO  SIM,  by  the  observations  of  Section  2— II, 
we  can  replace  rule  Cl  by  rule  SI.  {In  doing  so,  we  let  «  be  the 
ordering  of  occurrences  in  the  given  loop.)  We  then  get  the  mappings 
tt  and  J  to  rewrite  the  given  loop  as  a  DO  /  DO  FOR’ ALL  loop  as  in 
(4).  Of  course,  in  applying  rule  SI  to  choose  tt,  we  use  the  sets 
<  f,  g  >  as  calculated  in  Section  III  of  this  chapter. 

The  only  additional  comment  we  have  to  make  is  about  making 
the  optimal  choice  of  tt.  Since  the  ^  and  u^  may  not  be  known  at  com¬ 
pile  time,  we  do  not  know  the  values  of  the  M*  for  the  expressions  (9) 
or  (11)  which  we  must  try  to  minimize.  For  a  practical  approach,  M* 
can  be  approximated  in  one  of  two  ways: 

(i)  by  a  FREQUENCY  statement  if  one  appears  in  the 
program,  or 

(ii)  using  the  quantities  X*  and  u*  defined  in  Subsection 
IV  A. 

Once  we  have  obtained  rr  and  J,  we  still  have  the  problem  of  re¬ 
writing  the  loop  as  a  legal  extended  FORRRAN  DO  FOR  ALL  loop.  This 
is  always  possible,  but  involves  many  practical  details.  Indeed,  these 
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details  may  introduce  enough  inefficiency  to  offset  the  gain  due  to 
parallel  loop  execution. 

To  prevent  our  being  overwhelmed  by  details ,  we  will  restrict 
our  algorithm  to  the  hyperplane  case.  Moreover,  we  make  two  additional 
assumptions  beyond  those  needed  by  the  Hyperplane  Theorem: 

(i)  Each  d3  in  the  given  loop  (13)  equals  +  1. 

(ii)  Each  generated  variable  has  no  missing  index 
variables. 

B.  Writing  the  DO  /  DO  FOR  ALL 

Assume  that  we  have  found  the  mapping  n  :  Z2n  —  2Z  satisfying 
SI  and  we  have  constructed  the  index  variables  Ji  by  the  algorithm  of 
Appendix  A,*  Let  the  I*  and  J*  be  related  by 

i  n  i  1 

(16)  (a)  Jl=  T.  h|  I3 

j=l  3 

i  n  i  i 
(b)  I  -  2  r  J3 

j=l  3 

for  i  =  1 ,  . . . ,  n . 

We  will  rewrite  the  given  loop  in  the  form 

(17)  DO  a  J1  =  X1  ,  u1 
S  =  , . . 


it  j 

This  algorithm  allows  some  choice  in  constructing  the  J  .  Subsection 
E  explains  how  to  make  the  best  choice. 
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DO  a  FOR  ALL  (J2 ,  . , . ,  Jn)  /  S 
loop  body 

a  CONTINUE  . 


We  must  therefore  construct  the  assignment  statement  for  S  and  the 

^  i  i  o  u 

limits  X  ,  a  .  The  set  S  wants  to  be  assigned  the  value  {  (p  ,  . . . ,  pn)  : 
(J*,  p2,  . .. ,  pn)  s  lP}.  This  is  accomplished  by  first  writing  the 
statement 


S  =  [  (J2,  ... ,  f)  /  [X2  ...  u2]  .CROSS . CROSS.  [Xn  ...  un] 

X  (A1  $(.LE.  ,  .GE.  ;  d1)  I1)  .AND.  (u1  4>(.GE.,  .LE.;  d1)  I1)  ... 


X 


.AND.  (un  *(.GE.,  .LE.  ;  dn)  Z  t"  Jj)]  . 


j 


j 


Next,  we  substitute  for  each  I  appearing  in  the  statement,  using 
Equation  (16-b).  (This  includes  any  appearances  of  I*  in  the  l)  and  u-*.) 
Finally,  we  have  to  choose  the  numbers  X  ,  u  such  that  for  every 
element  (p* ,  . . . ,  pn)  e  iP  ,  *X*  <  p*  <  u* ,  for  i  =  2 ,  . . . ,  n. 

The  choice  of  X  ,  u  is  made  by  the  following  procedure. 

1.  Use  the  algorithm  of  Subsection  4A  for  the  case  k  =  n,  to  find 
numbers  X^,  such  that  for  every  (p*,  . . . ,  pn)  e  tP  , 


p^  <c  [X^ ,  X^  +  d^  ...  u^]  ,  j  =  1,  ... ,  n. 

Note  that  this  may  involve  rewriting  the  loop  body.  However, 
since  it  does  not  change  any  of  the  sets  <  f,  g  >,  it  does 
not  affect  our  choice  of  tt  and  J, 
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2.  By  the  theorem  of  Section  2,  and  Equation  (16-a) ,  we  can 
let 

X1  =  £  $(\j,  uj  ;  dj  hj)  hj 

j  3  3 

i?  =  E  A(Xj  ,  J  ;  dj  hj)  hj  . 
j  3  3 

We  also  define  X*,  u*  by 

X1  =  E  -Mi?,  u3  ;  d1  h1)  hj 

j  3 

u 1  =  E  A(t)  ,  u^;  d1  h.1 )  hj  , 

j  3  3 

where  the  0  ,  u3  are  defined  as  in  Subsection  4A  (for  the  loop  as  re¬ 
written  in  Step  1  above).  Note  that  X*  and  u3  are  loop  constants,  but 
may  be  unknown  at  compile  time. 

Unfortunately,  we  have  neglected  to  consider  the  fact  that  the 

nJ\ 

X  could  be  negative,  thus  making  invalid  the  above  set  expression 
defining  S.  In  any  event,  it  will  make  things  easier  for  the  compiler  if 
all  the  J*  vary  from  1  to  their  upper  limit.  This  is  easily  accomplished. 
We  first  change  (16-a)  to 

(18-a)  J1  -  E  hj  + 

j  3 

Defining  the  i*  by 

=  E  (Xj  -  1)  , 
j  1 

we  derive  the  following  inverse  to  (18-a) 

(18-b)  Ii=  £  tj  Jj  +  t1  . 

j  3 
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i  'vi  /""i 

Letting  u  =u  -  X  +  1 ,  we  can  now  rewrite  the  DO  statements 
as: 

DO  a  J1  =  1,  u1 

S  =  [  (J2,  ...,  f)  /  [  1  ...  u2]  .CROSS . CROSS.  [1  ...  unJ 

X  (il  *(.LE.,  .GE.  ;  d1)  I1)  .AND.  ... 

X  .AND.  (un  *(.GE.,  .LE.  ;  dn)  In)  ] 

DO  a  FOR  ALL  (J2  ,  ....  Jn)  /  S 

where  (18-b)  is  used  to  remove  all  instances  of  the  I1  from  the  above 
set  expression. 

We  should  also  add  an  algorithm  to  remove  redundant  conditions 

in  the  set  expression  for  S.  This  is  easily  done  for  some  types  of  con- 

3 

ditions:  e.g.,  (2  .LE.  J  ).  However,  we  have  not  attempted  to  write 
a  general  algorithm  for  this. 

C.  Reformatting  the  Variables 

Having  rewritten  the  DO  statements  using  the  new  index 

1  n 

variables  J  ,  ... ,  J,  the  obvious  next  step  is  to  use  Equation  (18-b) 
to  substitute  for  the  I*  in  the  loop  body.  However,  this  will  not 
usually  produce  a  legal  DO  FOR  ALL  loop.  To  illustrate  this, 
suppose  n  =  3  and  I  and  J  are  related  by 

11  =  J1  +  J2  +  2  J3  J1  =  I1  -  I2  -  2  I3 

12  =  J2  J2  =  I2 

13  =  J3  J3  =  I3 
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^Sy3S5M>=«S^SS  - 


2  13 

Then  the  variable  occurrence  A{I  ,1,1)  would  be  rewritten  as 
A(J2 ,  J*  +  J2  +  2  J3,  J3) ,  which  may  not  appear  inside  a  DO  FOR  ALL 
(J2,  J3)  loop. 

We  will  solve  this  problem  by  introducing  a  new  variable  A 
related  to  A  by 

A  (J1,  J2,  J3)  =  A(I2,  I1,  I3)  . 

This  is  done  as  follows: 

(i)  Introduce  the  loop 

DO  CONC  0  (I2,  I1,  I3)  /  range  of  A 

0  A  (I1  -  I2  -  2  *  I3,  I2,  I3)  =  A(I2,  I1,  I3) 
before  the  main  loop. 

(ii)  Replace  all  occurrences  of  A  by  the  appropriate 
occurrence  of  A . 

(iii)  If  A  was  a  generated  variable,  introduce 

DO  CONC  y  (I2 ,  I1,  I3)  /  range  of  A 
Y  A(I2,  I1,  I3)  =  Ad1  -  I2  -  2  *  I3,  I2,  I3) 
after  the  main  loop. 

Subsection  E  will  consider  the  problem  of  rewriting  these  DO  CONC 

loops  as  legal  DO  FOR  ALL  loops. 

We  note  here  that  the  only  time  a  legal  DO  FOR  ALL  will  be 

formed  by  simply  substituting  for  the  1^  inside  the  loop  body  is  when 

1  k 

tt  is  a  coordinate  projection  -  i.e. ,  when  J  =  I  for  some  k.  In 
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this  case,  each  Jl  may  be  chosen  to  be  one  of  the  ,  so  the  rewriting 
of  the  loop  body  can  be  done  by  the  procedures  of  the  Coordinate 
Algorithm.  For  the  rest  of  our  discussion,  we  assume  that  this  is 
not  the  case. 

We  now  generalize  the  method  we  applied  above  for  the  variable 
A  to  the  following  algorithm  for  reformatting*  variables. 

1.  For  notational  convenience,  assume  that  each  occurrence  f 
of  a  variable  VAR  is  of  the  form 


J 1  J 1  K  Ju  1  m 

VAR (r  1  +  f  ...,  I  K  +  f  \  f1 . r  )  , 

O  L 


where  <  ...  <  and  the  f  ,  r  are  loop  constants. 

(Of  course,  k  and  m  depend  upon  the  variable  .  but  not 
upon  the  particular  occurrence  of  that  variable.) 

2.  For  each  non-scalar  variable  VAR  appearing  in  the  loop  such 
that  0  <  k  <  n:** 

(a)  Introduce  a  new  (n  +  m)  -  dimensional  variable 

t  h 

VARX,  whose  j—  dimension  is  defined  to  be  equal  to: 

(i)  The  i~ dimension  of  VAR,  if  j  =  j.. 

(ii)  The  (j  -  n  +  k)—  dimension  of  VAR  if 
j  >  n. 

'fc 

Do  not  confuse  reformatting  with  the  reallocation  done  in  Step  13  of 
the  Coordinate  Algorithm  of  Subsection  IV  C. 


Here,  and  in  Step  3,  k  and  m  are  as  defined  in  Step  1,  for  the  par¬ 
ticular  variable  under  discussion. 
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(iii)  j  uj  -  Xj  |  +  1  otherwise,  where  iJ ,  X^ 
are  the  numbers  found  in  Subsection  B  (in 
Step  1  of  the  procedure  for  choosing  the 


(b)-  Insert  the  following  loop  before  the  main  loop: 

DO  B  CONC  FOR  (I1 ,  . . . ,  In+m)  /  range  of  VARX 

p  VARX  (  z1,  ?n+m)  =  VAR  (l1 . A', 

.n+1  .n+m> 

1  /•••/!  )  , 

- 

-  $  (X^ ,  u^;  d^  +  1  if  j  =  for  some  i 

where  =  S 


Ij  otherwise . 

. 

(c)  Replace  each  occurrence  f  of  VAR  in  the  loop  by 

VARX(e*  ,  . . . ,  en,  f*  ,  . . . ,  fm ) ,  where 

c  c 

F*  +  f^  if  j  =  ji  for  some  i 


< 


1^  otherwise. 

» 

For  each  non-scalar  variable  VAR  appearing  in  the  loop  (as 
rewritten  in  Step  2)  such  that  k  =  n: 

(a)  Introduce  a  new  (n  +  m) -dimensional  variable  VAR.  If 
the  dimensions  of  VAR  are  6  * ,  . . . ,  6n+m  then  the 
dimensions  6  * ,  . . . ,  !>n+m  are  defined  as  follows . 
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For  i  =  1 ,  . . . ,  n;  let 


h1 

?  =  £  A(1,6i;hJ)  h}  . 

) 


h  1  -  61  +  1  if  i  <  n 


Tnen'6i  =  < 


6  ifn<i<n+m. 

■ 

(b)  Insert  the  following  loop  before  the  main  loop: 
DO  8  CONC  FOR  (I1 ,  . . . ,  ln+m)  '/  range  of  VAR 


8  VAR  (E  h1  I3  -  6  ‘ 1  +  1 ,  .  „ , ,  I  h"  IJ  -  &“  +  1 , 
3  1  3  5  . 


n  Tj  _  Rn 


Tn+1 


i“Ti . in+tn)  =  var(i* ,  rri“) . 

t 

(c)  Replace  each  occurrence  f  of  VAR  in  the  loop 

by  . 

VAR  (J1  +T1  -  61  +  E  h*  fj,  ... , 

j  ] 

Jn+tn-6n+E  h"f<,  f‘  fm).  . 

j  3  c 

(d)  If  VAR  was  a  generated  variable,  add  the  following 

i 

loop  after  the  main  loop:  *  r 

DO  8  CONC  FOR  (I1 . In+m)  /  range  of  VAR 

p  VAk  (I  ,  ...»  I  )  - 

VAR  (  E  h1  Ij  -  5 1  +  1 ,  . . . ,  E  h*  Ij  -  £n  +  1 , 

3  ]  3  J 

,n+l  Tmv 

I,  .«•!)•  ■  i 


Tn+m. 
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D.  Discussion  of  Reformatting  ,  * 

!  .  , 

We  begin  theidiscussion  with  an  explanation  of  the’preceding 
algorithm . 


! 


1.  .This  definition  of  the  f-‘  anfd  f  for  the  occurrence  f  is  used 

■  C  :  i 

in  steps  2  and  3,  as  is,  thei  definition  of  k  and  m  for  the 

i 

i  ; 

variable  of  that' occurrence .  It  is'  well  toremember  that  the 

*  . 

loop  constants  fJc  may  be  functions  of  an  index  variable  :of 

i  * 

some’ outer  loop  containing  the  given  loop -(13). 

2:.  If  any  of  the  1^  are.  mis  sing  from  the  occurrences  of  VAR,  it 

i  i 

,  i 

is  necessary  tp  put  them-i'n  so  that  we  may  apply  Step  3. 
i  This  i£  primarily,  because  the  DO  FOR- ALL  syntax  requires 
,  all  ocourren'ces  to  involve  the  entire  multi-index 

2  i  n 

(J  /  •»./  J  ).  Also,  it  will  be  impossible :to  represent 


H 


j 


(I  1 ,  • ,  I  k)  in  terms  of  k  of  the  J1. 


Note  ;that  the  DO  CONC  loop  of  (b)  is  easy  to  translate 
!  ’  • 
into  DO  /  DO  I^OR  ALL  loops .  1 

!  I 

i  The  remarks  we  make  for  Step  3  about  the  last  m  dimensions 

i 

'  of  VAR  apply  to  this  step  as  well. 

>  3.  Here,  w.e  replace  V^R  by  the  variable  VAR  which  is  related 

to  it  by  ’  ,  ‘  : 

,  i  1  1 

(11-9)  VAR  (J1 . Jn,  In+1,  .....  In+™)  =  ,VAR  (I1 . In+m). 
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The  correctness  of  the  dimensions  follows  from  the 

theorem  of  Section  II.  I.e. ,  the  subscript  values  for  the 

occurrence  of  VAR  in  (b)  range  from  1  through  the^S1. 

Note  that  we  have  reformatted  the  entire  (n+m) -dimensional  array 

VAR,  even  though  the  loop  may  onLy  reference  one  n~dimensional 

slice,  and  can  reference  at  most  one  slice  for  each  occurrence  of  VAR 

in  the  loop.  To  see  why  this  may  be  necessary,  consider  the  two 
12  12 

occurrences  A(I  ,  I  ,  K)  and  A(I  ,  I  ,  L),  L  and  K  unknown  loop 
constants.  Since  these  two  occurrences  may  reference  the  same 
slice,  or  different  slices,  we  must  reformat  all  of  A.  However,  if  all 
occurrences  of  A  are  of  the  form  A(  . . . ,  . , .  ,  K) ,  then  we  need 

3 

only  reformat  the  single  I  =  K  slice. 

The  generalization  of  our  algorithm  to  one  which  only  reformats 
the  part  of  the  array  that  may  be  referenced  is  straightforward,  but 
is  tedious  to  write  in  full  detail.  Note,  however,  that  it  is 
influenced  by  the  placement  of  the  reformatting  loop,  which  is  dis¬ 
cussed  below. 

The  efficient  translation  of  the  DO  CONCs  produced  by  (b) 
and  (d)  may  be  difficult.  It  will  be  discussed  at  length  in  subsection 

D. 

Note  that  3(b)  need  not  be  done  if  all  uses  of  VAR  reference 
only  values  which  are  generated  in  the  loop. 

Since  reformatting  must  be  done  for  essentially  all  non¬ 
scalar  variables  in  the  loop,  it  represents  a  large  "overhead  cost" 
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for  the  Hyperplane  Method.  However,  the  loop  (13'  will  often  be 
contained  in  a  larger  loop  (either  a  programmed  loop,  or  another  DO 
loop).  The  reformatting  can  then  be  moved  outside  the  loop.  Any 
occurrences  of  the  variable  VAR  after  the  reformatting  loop  of 
Step  3(b) ,  and  before  that  of  Step  3(d)  can  be  replaced  by  occurrences 
of  VAR.  The  required  occurrence  mappings  are  easily  derived  from 
the  formula  in  Step  3(b). 

The  problem  of  finding  the  optimal  location  for  the  reformatting 
loops  is  very  similar  to  the  problems  encountered  in  compiler 
optimization. 

E.  Translating  the  DO  CONCs 

We  now  consider  the  problem  of  translating  the  DO  CONCs 
produced  in  Steps  3(b)  and  (d)  above.  Since  a  loop  involving  a 
DO  FOR  ALL  will  be  much  more  efficient  than  an  ordinary  sequential 
DO  loop,  we  will  almost  always  want  to  use  a  DO  FOR  ALL.  However, 
consider  the  following  loop  which  might  be  produced  in  Step  3(b); 

(19)  DO  0  CONC  (I1,  I2)  /  range  of  A 

0  A  (2  *  I1  +  3  *  I2  -  2,  I1  +  I2)  =  Ad1,  I2'  . 

Because  of  the  complicated  subscripting  of  A,  this  cannot  be 

12  12 

translated  into  a  legal  DO  FOR  ALL  on  I  or  I  or  (I  ,  I  ) . 

In  such  a  case,  the  reformatting  is  done  in  stages  as  follows: 
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Let  4  ,  *  •  •  •  /  J*  be  the  sequence  of  variables  constructed  in 

Appendix  A,  where  J*  =  I1  and  J*  =  r .  The  DO  CONC 

loop  generated  in  Step  3(b)  of  subsection  B  is  replaced  by  a  sequence 

of  the  following  loops ,  for  a  -  0 ,  ...»  t  -  1: 

DO  p  CONC  FOR  (J* . jJ)  /  range  of  VAR 


(20) 


B 


vAiw  Ci  -  Ci 


0  0 

+  1>  •••' 


'a+l  -a+1 


o 

+  1)  = 


VAVi  oj . 0 


where  the  6*0+i  are  computed  as  in  Step  3(a),  and  the  are 

replaced  by  their  values  given  in  Equation  (A-2)  of  Appendix  A.* 

The  VAR  are  new  variables  which  we  have  introduced, 
a 

From  Equation  (A-2),  we  see  that  the  DO  CONC  loop  (20)  can 
alwavs  be  rewritten  with  a  DO  FOR  ALL  .  Thus ,  if  there  is  more 
than  one  choice  for  the  index  p,  we  should  choose  the  one  for  which 
1^  is  the  best  DO  FOR  ALL  variable.  We  can  add  this  to  the 
algorithm  of  Appendix  A  for  choosing  J. 

Suppose  that  for  some  particular  a  and  some  r,  the  algorithm  of 
Appendix  A  yields  q^5  =  0.  This  then  implies  that  qj?  =  0  for  all 
p  >  q,  and  f  =  £  =  (neglecting  the  permutation  mentioned  in  the 
footnote).  Hence,  we  can  replace  Ja+^  by  JT  (and  VARo+1  by  VAR^)  in 

(20)  and  get  a  DO  CONC  which  can  be  translated  into  a  DO  FOR  ALL 
*  7  7 

Recall  that  for  o  +  1  =  t,  so  Ja+j  =  J  ,  an  additional,  permutation  of 
the  superscripts  may  be  required . 
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loop.  This  eliminates  the  last  t  -  (  a  +  1)  stages  of  the  reformatting e 
For  convenience  of  notation,  set  t  =  a  +  1  in  this  case. 

We  thus  have  a  sequence. of  DO  CONC  loops  (20)  for  the 
variable  VAR.  The  construction  of  this  sequence  is  essentially 
independent  of  the  variable.  I.e. ,  the  process  is  really  the  con¬ 
struction  of  the  .  Hence,  it  ;is  only  done  once  for  all  the  variables. 

The  dimensions  of  the  array  VARQ  are  the  same  as  those  of  VAR. 
The  dimensions  of  VAR^,  . . . ,  -VARt  are  determined  sequentially  by 
the  same  method  used  in  Step  3(a)  of  Subsection  C.  The  dimensions 
of  VAR^  will  then  equal  those' of  VAR. 

The  storage  allocations'  for  the  VAR  are  made  so  as  to  permit 

•  a 

the  JQ  FOR  ALL  in  the  translation  of  (20).  If  the  storage  allocation 

'*  • 

for  VAR  agrees  or  can  be  made  to  agree  with  the  allocation  of  VARq , 
then  VAR  is  substituted  for  VARq  in  (20),.  Otherwise,  the  assign¬ 
ment  statement 

VAR0  =  VAR 

must  precede  the  reformatting.  Similarly,  the  allocation  of  VAR 
must  allow  the  DO  FOR  ALL  on  (J^ ,  . . . ,  J11) .  If  this  is  consistent 
with  the  storage  allocation  required  by  VARt  ,  then  replace  VARt 
with  VAR.  Otherwise,  the  statement 
VAR  =  VAR 

T 

must  follow  the  loops  (20), 


The  translation  of  the  DO  CONC  loops  generated  by  Step  3(d)  of 
Subsection  B  is  obtained  by  the  obvious  reversal  of  the  loops  con¬ 
structed  above. 

The  above  procedure  sounds  very  long  and  costly  when  des¬ 
cribed  abstractly.  Observe,  though,  that  r  =  1  corresponds  to  the 
case  in  which  the  DO  CONC  loops  of  3(b)  and  (d)  may  be  written 
immediately  with  a  DO  FOR  ALL  (except,  perhaps,  for  storage 
reallocation).  In  general,  as  mentioned  in  Appendix  A,  we  have 
r  <  minimum  {  |  h^  J  :  h^  /  G  }.  In  practice,  t  will  usually  be 
small ,  and  will  often  equal  1 . 

To  illustrate  this  procedure,  consider  the  DO  CONC  of  (19) 
which  was  obtained  from  the  tt  defined  by  tt  [(I* ,  I2)  ]  =  2  I1  +  3  I2,  ' 
Application  of  the  algorithm  of  Appendix  A  gives 

li  =  Jo'  Ji'  J2  =  ji'  where 

jj^+I2  I1  =  Jj  +2jJ 

]\  =  I2  J2  =  j} 

If  the  dimensions  of  A  are  20,  30,  then  we  get  from  3(a) 


=  2 

6  J  =  50 

=  49 

=  1 

6  2  =  30 

6  2  =  30 

=  6^  =  3 

?l  =  128 

61  =  126 

=  1 

■  6  2  =49 

62  =  49 
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Thus,  49,  30  are  the  dimensions  of  and  126,  49  are  the  dimensions 
of  A2  =  A.  (We  assume  the  default  allocation  for  A  and  A,  so  no 
•reallocation  is  necessary.)  We  can  then  translate  (19)  into: 

DO  gj  I2  =  1,  30 

DO  FOR  ALL  I1  '/  [1  ...  20] 

Pj  A^I1  +  I2  -  1,  I2)  =  Ad1,  I2) 

DO  P2  J2  =1,  30 

DO  P2  FOR  ALL  J  j  /  [1  ...  49] 

P2  A(J2  +2  J1)  =  A]L(j|,  J2) 

Observe  that  these  loops  require  a  total  of  60  ILLIAC 
iterations,  compaied  to  600  iterations  if  (19)  were  executed 
sequentially.  A  49  by  30  temporary  array  A1  had  to  be  introduced. 
However,  note  that  A^  can  be  overlapped  with  any  other  similarly 
introduced  temporary  array  -  i.e. ,  with  VAR^  for  any  variable  VAR. 

F.  Avoiding  Reformatting 

There  is  one  case  in  which  reformatting  can  be  avoided  al¬ 
together,  namely  the  case  in  which  the  following  hold: 

(i)  tt  satisfies  rule  Cl,  so  the  loop  (13)  can  be  rewritten 
with  a  DO  CONG. 

(ii)  a^  =  0  for  some  i,  where  ai  is  as  in  (8)  of  Section  1-7. 
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An  examination  of  the  algorithm  of  Appendix  A  shows  that  (ii) 
implies  that  Jl  =  I1,  and  in  (6r-b)  and  (8-b)  we  have  tj  =  0  if  j  /i.* 
Therefore,  using  (8-b)  to  substitute  for  the  Ij  in  the  loop  body  pro¬ 
duces  a  valid  DO  FOR  ALL  J1  loop  body.  Thus ,  we  need  only  trans¬ 
late  the  DO  CONC  (J2 . f)/8  statements  into  a  sequence  of 

DOS  followed  by  a  DO  FOR  ALL  J1.  This  translation  procedure  is 
similar  to  the  ones  we  have  already  performed,  so  a  detailed 
algorithm  is  omitted. 

If  \  =  ’  *  *  =  aik  =  0,  above  procedure  can  be  generalised 
to  yield  a  DO  FOR  ALL  Q1 ,  ...,  ioop. 

G*  The  Number  of  ILLIAC  Iterations 

The  mapping  rr  should  really  be  chosen  to  minimize  the  total 

number  of  ILLIAC  iterations.  We  can  now  write  a  formula  for  that 
number. 

Let  ran  denote  the  smallest  integer  greater  than  or  equal 
to  a.  From  Subsection  B,  it  is  easy  to  see  that  the  number  of 
ILLIAC  iterations  needed  to  execute  the  loop  is 

(  v1  +  1)  T  , 

where  N  -  (  u  +  1)  . . .  (  un  +  l)  /  64  (assuming  a  64  P.E.  ILLIAC). 

We  are  again  neglecting  the  first  superscript  permutation  in  the 
algorithm  of  Appendix  A.  The  necessary  modifications  to  the  dis¬ 
cussion  are  obvious.  Note  that  J1  will  never  be  J1. 
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The  expression  (9)  of  Section  1— VII  which  we  decided  to 

1  2  n 

minimize  is  actually  equal  to  u  .  Since  computation  of  u  ,  . . . ,  u 

involved  the  variables  J1,  if  is  clear  that  the  u1  cannot  be  written 

as  a  simple  function  of  tt  when  i  >  1 . 

H.  An  Example 

It  is  of  interest  to  see  just  what  sort  of  loop  if  produced  by 
these  procedures.  We  consider  the  following  simple  relaxation 
loop: 

DO  77  I1  =  2,  N 


DO  77  f  =  3,  M 

Ad1,  I2)  =  .25  *  (Aft1  +  1,  I2)  +  Ad1  -  1,  I2) 

+  Ad1,  I2  +  1)  +  Ad1,  I2  -1}  ) 

77  CONTINUE 
with  A  defined  by 

DIMENSION  A(35 ,  50)  . 

Application  of  the  method  used  in  the  proof  of  the  Hyperplane 

,2 


Theorem  gives  the  optimal  mapping  tt  :  TL 
tt  [  (I1,  I2)  ]  *  I1  +  I2  . 


23  defined  by 


The  algorithm  of  Appendix  A,  with  the  addition  made  in 
Subsection  E,  gives 

j1  =  i1  +  i2  r1  =  j2 


j2  -  r1 


i2  =  J1  -  J2 
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We  now  apply  the  algorithm  of  Subsection  B  to  write  the 

DO  and  DC  FOR  ALL  statements.  This  requires  first  applying  the 

algorithm  of  Subsection  IV  A  for  the  case  k  =  n  =  2,  to  find  the 
i  i 

A.  ,  u  .  Applying  that  algorithm  gives  the  following  results  : 

X1  =  1  u1  =  3S 

X2  =  1  u2  =  50 

and  rewrites  the  loop  as 

DO  77  I1  =  1,  N  -  1 

DO  77  I2  =  1 ,  M  -  2 

Ad1  +  1 ,  I2  +  2)  =  .25  *  (Ad1  +  2,  I2  +  2) 

+  Ad1,  r2  +  2)  +  Ad1  +  i,  i2  +  3) 

♦  Ad1  +  i,  i2  +  i) ) 


77  CONTINUE  . 

Continuing  the  algorithm  of  Subsection  B,  we  get 


X1  =  2  ij1  .  N+M  -  3 

o 

II 

H 

-o 

v1  =  N+M  -  4 

V  =1  u2  =  35 

t2=l 

v2  =  35 

(18-a,  b)  become 

112 
j  **  r  +  r  - 1 

I1  = 

J2 

i2  =  r1 

12  = 

J1  -  J2  +  1 

We  then  get  the  following  loop  control  statements: 
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DO  77  J1  =  1#  N  +  M  -  4 
S  =  [  J2  /  [1  ...  35]  :  (1  .LE.  J2)  .AND, 

((N  -  1)  .GE.  J2)  .AND.  (1  .LE.  (J1  -  J2  +  1)) 
.AND.  ((M  -  2)  .GE.  (J1  -  J2  +  1))  ] 

DO  77  FOR  ALL  J2  /  S  . 

We  now  apply  the  algorithm  of  Subsection  C  to  the  variable 
A.  We  then  get 


61  =  2 

6  1  =  85 

CO 

ii 

r-H 

6 2  =  1 

T  2  =  35 

~1 

6  =  35  . 

We  have  the  following  DO  CONC  loops  inserted  before 
and  after  the  main  loop,  respectively: 

DO  177  CONC  FOR  (I1,  I2)  /  [1. .  .35]  .CROSS.  [1...50] 

177  A  (I1  +  I2  -  1,  I1)  =  Ad1,  I2) 

DO  277  CONC  FOR  (I1)  I2)  /  [1. .  .35]  .CROSS.  [1. .  .50] 

277  Ad1,  I2)  =  Ad1  +  I2  -  1,  I1)  . 

The  body  of  the  loop  is  rewritten  as  follows: 

Ad1  +  3,  J2  +  1)  =  .25  *  (AfJ1  +  4,  J2  +  2)  +  A  (J1  +  2,  J2) 

+  A/d1  +  4 ,  J2  +  1)  +  A  (J1  +  2 ,  J2  +  1)  ) . 

The  DO  CONC  loops  may  be  immediately  translated  into 
DO  I1  /  DO  FOR  ALL  I2  loops .  Combining  all  of  this ,  we  get  the 
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following  rewriting  of  the  loop: 

DO  177  I1  =  1,  35 

DO  177  FOR  ALL  I2  /  [1  ...  50] 

177  Ad1  +  I2  -  1,  I1)  =  Ad1,  I2) 

DO  77  J1  =  1,  N  +  M  -  4 

S  =  [  J2  /  [1  ...  35]  :  (1  .LE.  J2)  .AND.  ((N  -  1)  .GE.  J2) 
.AND.  (1  .LE.  (J1-  J2  +  1))  .AND.  ((M  -  2)  .GE. 

(J1  -  J2  +  D)  3 
DO  77  FOR  ALL  J2  /  S 

AQ1  +  3 ,  J2  +  1)  ~  .25  *  {  A  (J1  +  4 ,  J2  +  2)  +  Atf1  +  2,  J2) 

+  AQ1  +  4,  J2  +  1)  +  AQ1  +  2,  J2  +  1) 

77  CONTINUE 

DO  277  I1  =  1,  35 

DO  277  FOR  ALL  I2  /  [1  ...  50] 

277  Ad1,  I2)  =  Ad1  +  I2  -  1,  I1) 

2 

Except  for  the  redundant  condition  (1  .LE.  J  ) ,  this  is  about  as 
efficient  a  rewriting  of  the  loop  as  we  can  expect.  As  mentioned  in 
Subsection  C,  an  algorithm  to  remove  such  conditions  can  be  con¬ 
structed. 
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Despite  its  complexity,  the  above  loop  will  probably  run  about 
6  times  faster  on  the  ILLIAC  than  the  original  loop.  Furthermore,  most 
such  relaxation  loops  occur  inside  another  loop  which  includes  a  con¬ 
vergence  test.  The  reformatting' loops  could  be  placed  outside  this 
outer  loop.  We  would  then  get  a  loop  which  runs  about  9  times  faster 
than  the  original. 

I.  Simultaneous  Application  of  the  Method 

The  Hyperplane  Method  requires  reformatting  all  the  variables 
in  the  loop,  which  introduces  a  large  overhead  cost.  One  way  to 
reduce  this  overhead  is  to  use  the  same  reformatted  variables  in  more 
than  one  loop. 

Suppose  that  there  are  several  loops  in  the  program  which 
satisfy  the  hypothesis  of  the  Hyperplane  Theorem,  and  are  all  of  the 
same  dimension.  (I.e. ,  they  have  a  common  value  of  n.)  If  the  same 

1  n 

index  variables  J  ,  . . . ,  J  are  used  for  all  the  loops  then  it  may  be 
possible  to  do  the  reformatting  only  once,' 

In  fact,  such  a  single  n-tuple  of  index  variables  can  be  chosen 
which  allows  all  the  loops  to  be  written  with  a  DO  FOR  ALL  (JA ,  . . . ,  J  ). 
All  we  do  is  apply  our  method  of  choosing  tr  to  the  sets  ^  f,  g  >  of  all 
the  loops  taken  together.  Any  tt  satisfying  rule  SI  will  then  wor«.  fo. 
all  the  loops.  To  find  the  optimal  choice,  assign  to  each  loop  a 
weighting  factor  proportional  to  both  the  execution  time  of  the  loop 
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body  and  the  frequency  of  execution  of  the  entire  loop.  Then  ;n  is 
chosen  to  minimize  the  expression  formed  as  follows:  For  each  loop, 

t 

*  i 

take  the  expression  (a)  of  Section  -Vil  and  multiply  it  by  the  above 
wieghting  factor.  Then, sum  over  all  the  loops. 


4-59 


! 


I.  INTRODUCTION 


We  have  derived  methods  for  determining  sets  of  points ,  in  the 
control  index  space  of  a  nest  of  FORTRAN  DO  loops,  for  which  concurrent 
and  ILLIAC-simultaneous  execution  of  the  statements  of  the  loop  body  can 
be' performed.  These  analysis  methods  rely  on  the  subscript  expressions 
of  array  references  being  loop-constant  affine  transformations  of  the 
loop  index  variables  of  a  restricted  form  (see  Chapter  3  -  Section  I).  In 
cases  where  these  restrictions  are  not  met,  sets  of  loop  index  points 
for  concurrent  loop  body  execution  can  be  derived  by  a  simulation  of  the 
control  history  of  the  loop.  The  simulation  method  works  best  for  nests  of 
DO  loops  in  which  the  control  limit  parameters  (at  least  at  the  outermost 
level)  are  known  constants.  The  method  also  depends  on  the  array 
reference  subscript  transformations  being  dependent  only  on  quantities 
determined  within  the  loop  or  else  determinable  statically  from  the  rest  of 
the  program. 


I 


II.  FIRST  EXAMPLE 

The  following  discussion  shows  an  example  of  a  very  simple  loop 
form  which  does  not  satisfy  the  requirements  for  the  analytic  methods  of 
parallelism  detection.  The  simulation  technique  is  applied  to  the  example 
and  to  a  generalization  of  it,  showing  results  in  determining  an  improvement 
in  execution  speeds  derived  from  potential  simultaneous  executions  of 
previously  separate  executions  of  the  loop  body. 

The  general  problem  of  this  example  can  be  stated  as:  "What  kind 
of  simultaneity  or  concurrent  execution  can  be  found  in  loops  of  the  form: 


Depending  on  the  forms  of  the  subscripts,  the  generating  state¬ 
ment  may  access  one  cell  of  the  array  more  than  cnce  (particularly  if  there 
are  two  generators,  one  in  I  and  one  in  J  and  the  set  of  I  and  J  values  has 
a  non-empty  intersection).  Moreover,  the  generation-use  or  overwrite 
relations  between  the  generation  and  use  statements  shift  as  the  loop 
indices  change  values.  (This  is  in  marked  contrast  to  the  simple  singly- 
subscript  linear  forms  in  which  the  relations  are  constant  over  the  whole 
history  of  the  loop.) 
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Consider  the  following  example,  particularly  with  view  to 
changing  the  inner  or  outer  loop,  to  a  DO  FOR  ALL  loop.  It  is  presented 
in  terms  of  its  running  values  when  written  in  each  form. 


Original  Loop 
(sequential) 

DO  1  ]  =  1,3 

DO  2  1=1,3 

A(J)  =  B(J)  +  I 

C(J)  =  A(I) 

2  CONTINUE 

1  CONTINUE 


Proposed  Transformed  Loop 
(simultaneous  execution  of 
previously  "outer"  loop) 

DO  2  1=1,3 

DO  1  FOR  ALL  j/[l,  2,  3] 

A(J)  =  B(J)  +  I 

C(J)  =  A(I) 

1  CONTINUE 

2  CONTINUE 


History  of  loop  computation  in  both  forms. 


U.Va lue 

Statement 

Executed 

I, T  Value 

Statement 

Executed 

1,1 

A(l)  =  B(l)  +  1 

1,1 

A(l)  =  B(l)  +  1 

C(l)  =  A(l)  =  B(l)  +  1 

1,2 

A(2)  =  B(2)  +  1 

2,1 

A(  1)  =  B  (1)  +  2 

1,3 

A(3)  =  B(3)  +  1 

C(l)  =  Initial  value  of 
A(2) 

1,1 

C(l)  =  A(l)  =  B  (1)  +  1 

3,1 

A(l)  =  B(l)  +  3 

1,2 

C(2)  =  A(l)  =  B(l)  +  1 

C(l)  =  Initial  value  of 
A(3) 

1,3 

C(3)  =  A(l)  —  B(l)  +  1 

1,2 

A(2)  =  B(2)  +  1 

2,1 

A(l)  =  B(l)  +  2 

C(2)  =  A(l)  =  B(  1)  +  3 

2,2 

A  (2)  =  B(2)  +  2 

2,2 

A(2)  =  B{2)  +  2 

2,3 

A(3)  =  B(3)  +  2 

I.T  Value 


Statement 

Executed 


.■  I.  T  Value 


Statement 

Executed 


C{2)  A(2)  =  B(2)  +  2 

2>i 

C(l)  =  A(2)  =  B(2)  +  2 

3,2 

A(2)  =  B(2)  +  3 

2,2 

C(2)  =  A(2)  =  B(2)  +  2 

C(2)  =  Initial  value  of 
A(3) 

2,3 

C(3)  =  A(2)  =  B(2)  +  2 

1,3 

A(3)  =  B(3)  +  1 

etc. 

C(3)  =  A(l)  =  B(l)  +  3 

2.3  A(3)  =  B  (3)  +  1 

C(3)  =  A(2)  =  B{2)  +  3  ' 

3.3  A(3)  =  B{3)  +  3 

C (3)  =  A(3)  "  B(3)  +  3 

In  the  simultaneous  J  form ,  the  horizontal  dotted  lines  separate 
triples  of  statements  which  are  executed  simultaneously.  That  is,  all 

three  "A(J)  = - "  statements  are  executed  simultaneously  for  J  =  1, 

J  =  2,  and  J  =  3.  The  syntax  of  the  “DO  FOR  ALL"  loop  indicates  that  J 
is  the  "simultaneous11  index  and  takes  all  the  integral  values  1,2,  and 
3  during  the  execution  of  each  statement.  Each  value  is  assigned  to  a 
different  processor. 

The  proposed  rewriting  does  not  compute  the  same  values  as  the 
original.  In  particular,  the  final  C(l)  and  C(2)  values  should  be  the  same 
as  the  initial  A(3)  value,  but  will  be  B(3-  +  3. 

In  terms  of  the  sequential  loop 


the  generation  of  A(l)  at  (3,  1)  (I  =  3,  J  =  1)  provides  the 
value  for  the  use  of  A(l)  at  (1,  2)  across  index  set  points. 


-  the  generation  of  A(2)  at  (2 ,  2)  provides  the  value  for  the  use 
of  A(2)  still  within  the  same  loop  index  point, 
the  use  of  A(3)  at  (3,  2)  is  the  initial  value  of  A(3)  and  so  the 
generations  of  A(3)  at  (1,  3),  (2,  3)  and  (3,  3)  must  not  pre¬ 
cede  this  use. 

These  three  situations  are  not  all  satisfied  by  the  proposed  rewriting. 

Even  though  this  simple  rewriting  was  impossible,  that  does  not 
mean  that  there  is  no  inherent  concurrency  in  the  example. 

Consider  the  sequential  loop  example  expanded  to  I  =  1 ,  5  and 
J  =  1,  5  but  with  no  other  changes. 

The  notation  of  the  following  matrix  represents  a  history  of 
simulation  of  the  loop  body  in  the  order  in  which  the  loop  indices  are 
incremented: 

I 

l _ 

I  - > 

means  that  the  I  loop  is  iterated  before  the  J  loop  (the  I  loop  is  interior 
to  the  J  loop) . 

The  ordered  pairs  of  values  in  the  matrix  stand  for  I,  J  pairs  at 
a  point  in  time. 

The  arrows  demonstrate  generation  -  use  relations  in  the  simulation. 
They  are  derived  by  inspecting  subscript  forms  on  the  generation  and  use 
statements  keeping  in  mind  that  a  generator  precedes  in  time  a  use  of 
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that  value.  Thus  the  arrows  point  from  generations  to  uses.  This  inspection 
procedure  could  easily  be  mechanized. 

■  The  left  hand  margin  nodes  ‘!A(1)'’,  "A(2),x  etc.  are  the. initial  values 
of  A  oh  entry  to  the  loop.  The  lower  margin  nodes  "A(l)",  "A(2)"  etc.  are 
final  values  of  A  on  exit  from  the  loop . 

Generations  and  uses  of  the  variable  A  for  5  x  5  case  derived  from 
the  example: 
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Since  the  array  B  is  used  only  as  an  input  to  the  loop  (only  in 
"uses")  and  the  array  C  is  used  pnly  as  output  of  the  loop  (only  in 
"generations")  the  essential  ordering  information  lies  only  within  the  uses 
and  generations  of  A. 

To  the  use-generation  arcs  can  be  added  a  set  of  arcs  for  "over¬ 
write  avoidance  precedence".  In  this  example  there  is  in  general  more 
than  one  generating  instance  of  a  value.  (The  gen-use  arcs  link  "latest" 
generation  with  "proper"  use.)  For  instance ,  I,  J  points  (1,  1),  (2,  1), 

(3,  1),  (4,  1)  and  (5,  1)  all  generate  a  value  for  A(l).  Note  that 
(1,  1)  ...  (4,  1)  all  precede  (5,  1)  in  time  and  that  (5,  1)  generates  a 
value  which  will  be  used  by  (1 ,  2) (1 ,  3) ,  (1 ,  4)  and  (1 ,  5)  and  will  be 
the  output  value  for  A(l). 

To  prevent  overwrite  of  the  A(l)  value  properly  generated  by  (5,  1) 
arcs  must  be  drawn  from  (1,  1)  td  (5,  1),  (2,  1)  to  (5,  1),  (3,  1)  to 
(5,  1)  and  (4,  1)  to  (5,  1).  This  process,  too,  can  be  mechanized. 

This  set  of  arcs  and  nodes  can  be  used  to  study  the  sets  of  nodes 
(or  values  of  I,  J)  for  which  the  body  of  the  loop  can  be  executed  con¬ 
currently.  An  incidence  matrix  is  formed  and  used  according  to  the  method 
of  Ramamoorthy  [  6,  particularly  pages  5-7  ]. 

In  the  partition  list  notation: 

In  is  an  input  value  node 
On  is  an  output  value  node 
nn  is  a  point  in  the  history  of  the  loop  where 
n^  stands  for  I  =  n^,  J  =  ^  ♦ 
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In  the  list  of  arcs  notation: 

the  arcs  are  labelled  according  to  the  pair  of 

nodes  with  head  of  arc  preceding  tail. 


The 

set  of  arcs  used  to  form  the  in 

cidence  matrix 

11, 

51 

41, 

51 

12, 

21 

12, 

52 

42, 

52 

13, 

31 

13, 

53 

43, 

53 

13, 

32 

14, 

54 

44, 

54 

14, 

41 

15, 

55 

45, 

55 

14, 

42 

21, 

51 

51, 

12 

14, 

43 

22, 

52 

51, 

13 

15, 

51 

23, 

53 

51, 

14 

15, 

52 

24, 

54 

51, 

15 

15, 

53 

25, 

55 

52, 

23 

15, 

54 

31, 

51 

52, 

24 

51, 

01 

32, 

52 

52, 

25 

52, 

02 

33, 

53 

53, 

34 

53, 

03 

34, 

54 

53, 

35 

54, 

04 

35, 

55 

54, 

45 

55, 

05 

Note  that  II  never  appears  because  the  initial  value  of  A(l)  is  never  used. 

It  is  reasonable  to  add  an  assumption  to  the  results  of  the  method:  that 
all  the  inputs  precede  any  output.  This  is  not  strictly  necessary  in  this 
case.  Also  note  that  "self-loops"  like  (11,  11},  although  indicating  that 
A(l)  is  generated  and  then  used  within  the  point  (1,1),  are  not  included  in 
the  incidence  matrix.  The  aim  is  to  derive  sets  of  distinct  points  for 
which  the  loop  body  can  be  executed  in  parallel.  For  this  5x5  case  the 
technique  described  by  Ramamoorthy  gives  concurrency  partitions  ("earliest 
execution  initiation")  as: 
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time 

h 

tr. 


L10 

hi 


(t 


12 


Set 

(II,  12,  13,  14,  15),  11,  22,  33,  44 
21,  31,  32,  41,  42  43 

51 

(01),  12,  13,  14,  15 

52 

(02),  23,  24,  25 

53 

(03),  34,  35 

54 

(04),  45 

55 

(05)  ) 


The  I  and  0  events  are  "outside"  the  actual  loop.  One  can  assume  all  I's 
are  done  at  entry  and  can  assume  there  won't  be  any  use  of  an  O  until 
the  exit.  Note  that: 

1)  the  t1  and  t2  (non-I  stuff)  partitions  can  be  merged,  and 

2)  t12  really  is  an  external  time  slice . 

Thus  the  whole  loop  history  is  performable  in  10  chunks  instead  of  25. 

The  original  loop  for  Max  C  [I,  J]  }  =  [5,  5]  can  now  be  rewritten 
to  reflect  the  new  time  slices: 
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I 


Note:  The  problems  with  syntax  or  semantics  at  t2  result  from  the  values 

of  A(l)  -  A(3)  and  C(l)  -  C(3)  being  over-written  below.  There  is  a 

|  . 

generalization  of  this  case  (for  N  )  2)  which  shows  just  how  much 
improvement  in  execution  time  may  be  possible.  Ordinary  Fortran: 

DO  1  J  -  1,  N  ‘  •  .  : 

DO  2  I  =  1 (  N  ! 

use  of  both  J  and  I  in 
A(J)  =  B(J)  +  I  gen.  of  A(J) , 

i  2  i 

CO)  =  A(I)  Use  of  A(I) 

2  CONTINUE  ‘  j 

1  CONTINUE 

i 

2 

requires  N  executions  of  body. 


i  i 
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I 


Rev;ritten: 

i  i 

N-l  parallel  executions 
equiv.  to  1  fexecution  if' 

i 

l  fj-  1  processors 

i 

available 


.(N-2)  (N  -  3)  parallel 
;  2  executions 

*  ; 

equiv.  to  N-2  execu¬ 
tions  if  §  N  -,2  proc¬ 
essors  available 


DO  1  FOR  ALL  J  /  [1 4  2 , 
A(J)  =  B(J).  +  J 
1  ''  C(J);=  A(J) 


..,  N-l]_  Note 

change  of 
variables 


.DO  2  I  =  2,  N  -  1 
!  DO  2  FOR  ALL  J  /'[l ,  2,  ....  1-1] 
A  ( J)  =  B(J)  +  I 

•  i 

2  C(J)  =1  A(I)  i  ;  . 


N  -  1  execution^ 


DO  4  I  f  1 ,  N  -  1  j  ‘Note  change  of 

variables  and  intro- 


A(I)  =  B(I)  +  N1 
C<I)  =  A(N) 


duction  of  a  constant  N. 


U 


/N-l)  (N-2)  parallel 
2 

executions  equiv.  to 

i 

N— il  executions  if  S 


DO  3  FOR  ALL  J  /  [I  +  1 ,  I  +  2  ,  . . .  ,  N] 
.  A(J)  =  B(J).+  I 
3  C(J)  =  A(l) 


-N-2  processors 


available 


i  ^  CONTINUE 


1  execution  i 


A(N)  =  B(N)  +  N.  .This  could  have  been  in 

"DO  4  I"  loop  if  "DO  3 
C(N)  =  A(N)  \  FOR  ALL  J"  not  executed 

at  least  once. 


Total  equivalent  executions  of  "loop  body"  =  3N  -  2.  ! 
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III.  SECOND  EXAMPLE 


Another  example  of  the  use  of  a  simulation  technique  demonstrates 
that  there  is  often  a  considerable  difference  between  the  amount  of  con¬ 
current  execution  that  can  be  expressed  with  a  "linearized"  syntactic 
form  such  as  a  DO  FOR  ALL  and  the  amount  of  concurrency  inherent  to  a 
loop  body.  The  example  shows  a  form  of  "trans pose/ mapping " .  The 
maximal  concurrent  set  of  (I,  J)  points  is  a  triangular  helf  of  the  N  x  N 
square.  In  the  example,  all  the  generations  shown  can  be  performed 
concurrently  (there  is  no  overwrite  problem  among  generators)  and  then 
all  the  uses  can  be  performed  concurrently.  A  subset  of  these  two  sets  is 
given  by  a  sequential  stepping  of  a  "line"  of  concurrency  parallel  to  the 
J  axis  in  the  diagram. 

DO  1  I  =  1 ,  N 
DO  1  J  =  1 ,  N 
A(I,  J)  =  ~ 

1  — ^A(J,  I)  - ' 

The  DO  J  loop  can  be  transformed  to  a  DO  FOR  ALL: 

For  I  =  J,  intra-loop  gen-use  precedence, 

For  I  <  J,  inter-loop  generation-use  precedence: 

Execution  at  Ij,  <  is  generator 
for  use  at  I-  >  JT 
(specifically  IL  =  JR,  JL  =  IR) 

(Relating  K  and  L: 
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Let  K  =  (I  -  1)  *  N  +  J 
Then  L  =  (J  -  1)  *  (N  -  1)  +  K 

=  (J  -  1)  *  N  +  1  .) 

For  I  >  J,  inter-loop  non-overwrite  precedence: 

Use  at  IK  >  JK  is  of  an  initial  value  of  the  variable. 
Generation  at  IL  <  JL  (where  IL  =  JK  ,  JL  =  IR 
as  before)  is  for  an  output  or  final  variable  which 
will  not  be  reused  by  the  loop. 

Thus  for  a  given  I.. , 

K 

T*  —  f  T 

^  1  %-M'  ^K-M+l  '  ^K-l '  ^K'  ^K+l '  ^k+P^ 

as  an  active  set  for  parallel  execution  gives; 

for  K  *K ' 

1)  generate  an  output  value  which  won’t  be 
used  in  the  loop 

2)  use  a  value  generated  when  I  =  J  (earlier 

L  a 

in  the  I  loop) 

for,a  =  iK' 

1)  generate  an  output  value 

2)  use  the  value  just  generated  (intra-loop) 

for  Ja  >  rK ' 

1)  generate  an  output  value  which  will  be  used 

in  the  loop  at  the  point  that  IT=  J 

Li  CC 

2)  use  an  initial  value 
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J  ,  I  are  index  pair  for  =  ^A(J,I 
u  u  statement 


notes: 


1)  In  the  chart  rg  =  Iu,  Jg  “  Ju 
because  what  is  being  modelled  is  the  I,  J  values 
resulting  for  the  particular  DO  statements  with 
their  given  ordering.  (I.e.,  time  is  modelled.) 

2)  All  arrows  in  the  diagram  point  from  generation  to  use. 

3)  The  ordering  of  the  values  in  the  ordered  pairs  reflects 
the  data  point  being  referenced  at  a  point  in  time. 

It  can  be  seen  that  all  the  generations  in  the  "upper  right 
triangular  half"  of  the  index  set  can  be  executed  concurrently,  then  all 
the:uses.  Next  the  generations  and  then  the  uses  of  the  "lower  left 
triangular  half"  can  be  executed  concurrently.  This  gives  the  maximum 
concurrency  and  the  whole  loop  can  be  executed  in  2  steps  rather  than 
the  original  1C.  By  comparison,  as  has  been  shown,  a  DO  FOR  ALL  J 
rewriting  is  legal  but  provides  a  concurrency  improvement  factor  of 
only  4  instead  of  8.  The  derivation  of  generation-use  dependency  and 
the  sets  of  index  set  points  for  concurrent  execution  is  mechanizable 
as  described  in  the  previous  example. 
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LINEAR  SUBSCRIPT  EXPRESSIONS 

The  algebraic  loop  analysis  methods  depend  on  the  subscript 
expressions  within  array  references,  being  linearly  dependent  on  the  loop 
index  variables .  Given: 


determining  if  the  subscript  of  A  is  a  linear  function  of  I,  involves  "back- 
substituting"  for  all  variables  in  the  subscript.  The  algorithm  is  obvious, 
though  painstaking  and  involves  algebra  on  a  canonical  form 

m=(Sa1)i)+ao 

where  the  jj  are  the  variables  of  the  DO  statements  of  a  nest  and  {a^  , 
aQ  are  arbitrary  expressions  not  involving  these  variables.  If  any  chain 
of  back  substitution  cannot  be  put  into  the  canonical  form,  the  subscript 
is  not  linear. 

A  complexity  arises  from  the  following  example: 

N  =  0 

DO  10  I  =  1,  100 

N  =  N  +  5 

A(I)  =  B(N)  +  C(I) 

10  CONTINUE 
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which  is  equivalent,  in  effect,  to: 
..  ■  DO  10  I  =  1,  100 

A.(D  =  3(5  *  1)  +  C'(I) 

10  CONTINUE 


Note  that  this  is  the  essential  truth  behind  the  sequential  machine 
optimization  method  known  as  "reduction  in  operation  strength" .  What 
..w.e  want,  to  do  here  however  is  the  opposite  transformation  possibly  des¬ 
cribed  as  "solution  of  recursion  relations".  For  recursions  of  the  form 
n  =  n  +  a ,  the  solution  will  be  linear  in  the  loop  index  variable . 

Two  more  general  examples  are: 


■ ;  N  =  n 


N  =  N  +  K 


DC 

^ — ' 

2 

o 

£ 

ii 

N  = 

K 

ma 

I  +  (K  +  N 

A 


M. 


NT  K 
o 


)  = 


I  -  M 
_ o 


+  1)  K  +  N 

o 


Note  that  the  end  product  of  the  transformation  cannot  be  expressed  in 
integer  FORTRAN  variables  and  operations ,  in  general ,  because  the  results 
of  the  divisions  are  not  likely  to  yield  integers  although  the  final  form 
will  yield  integer  results. 


Mq  =  !,  Nq  =  °,  K  =  3,  =  2> 


IW. 


(viz. 


(2) 
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